Given the overwhelming quantity of messages posted in social networks, in order to to make their utilization more productive, it is imperative to filter out irrelevant information.
This work is focused on the automatic classification of public social data according to its potential relevance to a general audience, according to journalistic criteria. This means filtering out information that is private, personal, not important or simply irrelevant to the public, improving the the overall quality of the social media information.
A range of natural language processing toolkits was first assessed while performing a set of standard tasks in popular datasets that cover newspaper and social network text. After that, different learning models were tested, using linguistic features extracted by some of the previous toolkits. The prediction of journalistic criteria, key in the assessment of relevance, was also explored, using the same features. A new classifier uses the journalist predictions, made by an ensemble of linguistic classifiers, as features to detect relevance. The obtained model achieved a F1 score of 0.82 with an area under the curve(AUC) equal to 0.78.
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Relevance Mining and Detection System
1. Classification of Social Media Posts
according to their Relevance
Author:
Alexandre Pinto
Advisors:
Prof. Dr. Hugo Gon¸calo Oliveira
Prof. Dr. Ana Oliveira Alves
Faculty of Sciences and Technology
Department of Informatics Engineering
University of Coimbra
September 9, 2016
2. Summary
Contents
1 Introduction
2 Objectives
3 Benchmarking NLP Toolkits
4 Relevance Detection
5 Conclusions
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 2/60
5. Introduction
REMINDS
REMINDS = RElevance MINing and Detection System
Main Goal:
• Development of a system capable of detecting relevant
information, according to journalistic criteria, published in
social networks while ignoring irrelevant information such
as private comments and personal information, or public text
that is not important.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 5/60
6. Introduction
REMINDS
REMINDS = RElevance MINing and Detection System
Four Main Approaches/Four Different Teams:
• Text Mining
• Sentiment Analysis
• Interaction Patterns and Network Topologies
• Natural Language Processing (NLP)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 6/60
7. Introduction
What is Relevance ?
Definition:
• “The degree to which something is related or useful to what
is happening or being talked about”
Human notion:
• Hard to measure and define
Ambiguos nature:
• Cannot simply search for it (Information Retrieval)
• Must instead filter out irrelevant content (Classification)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
8. Introduction
What is Relevance ?
Definition:
• “The degree to which something is related or useful to what
is happening or being talked about”
Human notion:
• Hard to measure and define
Ambiguos nature:
• Cannot simply search for it (Information Retrieval)
• Must instead filter out irrelevant content (Classification)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
9. Introduction
What is Relevance ?
Definition:
• “The degree to which something is related or useful to what
is happening or being talked about”
Human notion:
• Hard to measure and define
Ambiguos nature:
• Cannot simply search for it (Information Retrieval)
• Must instead filter out irrelevant content (Classification)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
11. Objectives
Goal
• Automatic classification of public social data according to
their potential relevance to a general audience, filtering out
irrelevant information.
• Rely primarily on linguistic features, extracted with the help
of existing NLP tools
• Confirm if relevance can be predicted from a set of
journalistic criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
12. Objectives
Goal
• Automatic classification of public social data according to
their potential relevance to a general audience, filtering out
irrelevant information.
• Rely primarily on linguistic features, extracted with the help
of existing NLP tools
• Confirm if relevance can be predicted from a set of
journalistic criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
13. Objectives
Goal
• Automatic classification of public social data according to
their potential relevance to a general audience, filtering out
irrelevant information.
• Rely primarily on linguistic features, extracted with the help
of existing NLP tools
• Confirm if relevance can be predicted from a set of
journalistic criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
14. Objectives
The Big Picture
Figure: System Overview
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 10/60
15. Objectives
The Big Picture
Figure: System Overview (Focus of this work)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 11/60
17. Why Benchmarking NLP Toolkits
[Pinto et al.(2016)Pinto, Oliveira, and Alves]
For widely-spoken languages, such as English:
• Wide range of NLP toolkits
• Complex applications do not have to be developed from
scratch
• Difficult choice among the available tools
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 13/60
18. Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
19. Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
20. Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
21. Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
22. Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
23. Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
24. Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
25. Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
26. Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
27. Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
28. Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
29. Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
31. Addressed Tasks
Lower-level NLP Tasks
Figure: Tokenization1
1
www.nltk.org/book/ch07.html
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
32. Addressed Tasks
Lower-level NLP Tasks
Figure: Tokenization1
Figure: Part-of-Speech (POS) Tagging1
1
www.nltk.org/book/ch07.html
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
33. Addressed Tasks
Lower-level NLP Tasks
Figure: Chunking1
1
www.nltk.org/book/ch07.html
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
34. Addressed Tasks
Lower-level NLP Tasks
Figure: Chunking1
Figure: Name Entity Recognition/Classification2
1
www.nltk.org/book/ch07.html
2
stanfordnlp.github.io/CoreNLP
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
36. Used Datasets
• Public datasets to assess the performance of NLP tools and
thus making decisions
• Well-known and widely used in text classification
research, such as training and evaluating new tools
• Different gold standard datasets that cover different kinds
of text – newspaper and social media
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
37. Used Datasets
• Public datasets to assess the performance of NLP tools and
thus making decisions
• Well-known and widely used in text classification
research, such as training and evaluating new tools
• Different gold standard datasets that cover different kinds
of text – newspaper and social media
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
38. Used Datasets
• Public datasets to assess the performance of NLP tools and
thus making decisions
• Well-known and widely used in text classification
research, such as training and evaluating new tools
• Different gold standard datasets that cover different kinds
of text – newspaper and social media
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
39. Used Datasets
Newspaper and Social Media
CoNLL-2003 shared task data
• Collection of news wire
articles from the Reuters
Corpus (PoS,Chunk,NER)
Alan Ritter Twitter dataset
• Collection of randomly
sampled tweets
(PoS,Chunk,NER)
MSM 2013 workshop
• Collection of randomly
sampled tweets (NER)
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
40. Used Datasets
Newspaper and Social Media
CoNLL-2003 shared task data
• Collection of news wire
articles from the Reuters
Corpus (PoS,Chunk,NER)
Alan Ritter Twitter dataset
• Collection of randomly
sampled tweets
(PoS,Chunk,NER)
MSM 2013 workshop
• Collection of randomly
sampled tweets (NER)
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
41. Used Datasets
Newspaper and Social Media
CoNLL-2003 shared task data
• Collection of news wire
articles from the Reuters
Corpus (PoS,Chunk,NER)
Alan Ritter Twitter dataset
• Collection of randomly
sampled tweets
(PoS,Chunk,NER)
MSM 2013 workshop
• Collection of randomly
sampled tweets (NER)
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
42. Used Datasets
Newspaper and Social Media
CoNLL-2003 shared task data
• Collection of news wire
articles from the Reuters
Corpus (PoS,Chunk,NER)
Alan Ritter Twitter dataset
• Collection of randomly
sampled tweets
(PoS,Chunk,NER)
MSM 2013 workshop
• Collection of randomly
sampled tweets (NER)
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
43. Used Datasets
Newspaper and Social Media
PoS:
• Penn Treebank style
(CoNLL2003)
• PTB + twitter-specific tags
(@usernames, #hashtags,
and urls) (Ritter)
Chunking Format:
• IOB-TYPE format
Named Entities:
• PER, LOC, ORG or MISC
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
44. Used Datasets
Newspaper and Social Media
PoS:
• Penn Treebank style
(CoNLL2003)
• PTB + twitter-specific tags
(@usernames, #hashtags,
and urls) (Ritter)
Chunking Format:
• IOB-TYPE format
Named Entities:
• PER, LOC, ORG or MISC
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
45. Used Datasets
Newspaper and Social Media
PoS:
• Penn Treebank style
(CoNLL2003)
• PTB + twitter-specific tags
(@usernames, #hashtags,
and urls) (Ritter)
Chunking Format:
• IOB-TYPE format
Named Entities:
• PER, LOC, ORG or MISC
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
46. Used Datasets
Statistics
Dataset Documents Tokens Average Tokens per Document
CoNLL (Reuter Corpus) 946 203621 215
Twitter (Alan Ritter) 2394 46469 19
#MSM2013 2815 52124 19
Table: Dataset properties
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 22/60
48. Compared Tools
Standard vs Social NLP toolkits
Standard NLP toolkits:
• NLTK
• Apache OpenNLP
• Stanford CoreNLP
• Pattern
Social Network-Oriented Toolkits:
• TwitterNLP
• TweetNLP
• TwitIE
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 24/60
49. Compared Tools
Standard vs Social NLP toolkits
Standard NLP toolkits:
• NLTK
• Apache OpenNLP
• Stanford CoreNLP
• Pattern
Social Network-Oriented Toolkits:
• TwitterNLP
• TweetNLP
• TwitIE
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 24/60
50. Compared Tools
Tools Summary
System Programming Target Text Tok- PoS Chunking NER
Language enization tagging
NLTK Python Generic
OpenNLP Java Generic
CoreNLP Java Generic
Pattern Python Generic
TweetNLP Java Social Media
TwitterNLP Python Social Media
TwitIE Java Social Media
Table: Toolkit properties
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 25/60
52. Comparison Results
Dataset CoNLL Alan Ritter - Twitter
Task PoS Chunking NEC PoS Chunking NEC
PPPPPPPPPTool
Metric
F1 ± σ F1 ± σ F1 ± σ F1 ± σ F1 ± σ F1 ± σ
OpenNLP 0.88 ± 0.10 0.83 ± 0.12 0.87 ± 0.09 0.71 ± 0.17 0.45 ± 0.39 0.87 ± 0.13
TweetNLP 0.84 ± 0.09 n/a n/a 0.95 ± 0.07 n/a n/a
TwitterNLP 0.83 ± 0.15 0.83 ± 0.13 0.85 ± 0.12 0.92 ± 0.11 0.90 ± 0.11 0.95 ± 0.08
Table: Best Performance Results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 27/60
53. Comparison Results
Discussion
• Common NLP tools usually have good performance on
well-formed content, such as news
• Noisy and informal text, such as tweets, brings new
challenges, decreasing the performance
• Special tailored tools such as CMU TweetNLP and Twitter
NLP perform good on social media text and were used in the
feature extraction process.
• General purpose tools offer better support and are more
customizable (accept new trained models)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
54. Comparison Results
Discussion
• Common NLP tools usually have good performance on
well-formed content, such as news
• Noisy and informal text, such as tweets, brings new
challenges, decreasing the performance
• Special tailored tools such as CMU TweetNLP and Twitter
NLP perform good on social media text and were used in the
feature extraction process.
• General purpose tools offer better support and are more
customizable (accept new trained models)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
55. Comparison Results
Discussion
• Common NLP tools usually have good performance on
well-formed content, such as news
• Noisy and informal text, such as tweets, brings new
challenges, decreasing the performance
• Special tailored tools such as CMU TweetNLP and Twitter
NLP perform good on social media text and were used in the
feature extraction process.
• General purpose tools offer better support and are more
customizable (accept new trained models)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
56. Comparison Results
Discussion
• Common NLP tools usually have good performance on
well-formed content, such as news
• Noisy and informal text, such as tweets, brings new
challenges, decreasing the performance
• Special tailored tools such as CMU TweetNLP and Twitter
NLP perform good on social media text and were used in the
feature extraction process.
• General purpose tools offer better support and are more
customizable (accept new trained models)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
58. Relevance Detection
Methods used in this work
Definition of Relevance NLP Tasks Machine Learning Methods
Criteria3 Extraction Preprocessing Selection Reduction Models Evaluation
Controversialness Part-of-Speech Standardization Info. Gain PCA MDC Accuracy
Informativeness Chunking Normalization Gain Ratio kNN Precision
Meaningfulness Named Entities Scaling Fisher NB Recall
Novelty Polarity of words Pearson SVM F1
Reliability LDA topics Chi-square DT ROC
Scope N-gram RF AP
Stemming k-Fold-CV
Lemmatization
Table: Methods used in this work
3
Journalistic criteria established by CRACS@INESC-TEC
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 30/60
59. Related Work
Author(s) Mohammad
et. al
Sriram
Fernandes
et. al
Guerini
et. al
Zeng
et. al
Irani
et. al
Lee
et. al
Frain
et. al
Liparas
et. alFeature Groups
word ngrams
char ngrams
all-caps
POS
#hashtags
punctuation
emoticons
elongated words
clusters
authorship info.
digital media
#words
#links
lenght
LDA topics
polarity
lemmas
TF-IDF
#profanity
Target Class Sentiment
News’s
type
Popularity Buzz
Helpful
Opinion
Trending
Content
Trending
Categories
Satiric
Content
Topic
Source Data Twitter Twitter Mashable Digg Amazon Twitter Twitter Created News Sites
Classifier SVM SVM RF SVM SVM C4.5 NB SVM RF
Performance F1=0.69 Acc=0.96 F1=0.69 F1 =0.81 Acc=0.72 F1=0.79 Acc=0.65 F1=0.89 F1=0.85
Table: Related Work
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 31/60
61. Relevance Detection
Used Datasets
• Textual messages gathered (by CRACS@INESC-TEC) from
Twitter and Facebook
• Text quality preferred over text quantity
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 33/60
62. Relevance Detection
Used Datasets
• Textual messages gathered (by CRACS@INESC-TEC) from
Twitter and Facebook
• Text quality preferred over text quantity
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 33/60
63. Relevance Detection
Used Datasets
Twitter search queries:
• “refugees” and “Syria”
• “elections” and “US”
• “Olympic Games”
• “terrorism”
• “Daesh”
Official Facebook pages:
• Euronews, CNN, Washington Post, Financial Times, New
York Post, The New York Times, BBC News, The Telegraph,
The Guardian, The Huffington Post, Der Spiegel
International, Deutsche Welle News, Pravda and Fox News.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 34/60
64. Relevance Detection
Used Datasets
Twitter search queries:
• “refugees” and “Syria”
• “elections” and “US”
• “Olympic Games”
• “terrorism”
• “Daesh”
Official Facebook pages:
• Euronews, CNN, Washington Post, Financial Times, New
York Post, The New York Times, BBC News, The Telegraph,
The Guardian, The Huffington Post, Der Spiegel
International, Deutsche Welle News, Pravda and Fox News.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 34/60
65. Relevance Detection
Used Datasets
• The same method was used with other journalistic criteria,
such as: interestingness, controversy, meaningfulness,
novelty, reliability and scope.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 35/60
66. Relevance Detection
Used Datasets
#Facebook Posts #Facebook Comments #Tweets
Search Word Relevant Irrelevant Relevant Irrelevant Relevant Irrelevant
“Refugees” + “Syria” 20 4 30 13 55 23
“Elections” + “US’ 21 8 21 14 29 39
“Olympic Games” 2 0 4 1 22 114
“Terrorism” 53 16 138 88 59 53
“Daesh” 2 0 14 12 26 30
“Referendum” + “UK” + “EU” 4 0 7 1 14 4
Table: Documents grouped by source, relevance label and query
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 36/60
67. Relevance Detection
Used Datasets
Content Source
Answers
Class
A1 A2 A3
Putin: Turkey supports terrorism and stabs Russia in the
back
FB post 5 4 5 Relevant
Canada to accept additional 10,000 Syrian refugees Tweet 4 5 5 Relevant
Lololol winning the internet and stomping out daesh
#merica
Tweet 1 1 1 Irrelevant
Comparing numbers of people killed by terrorism with
numbers killed by slipping in bath tub is stupid as eff. It
totally ignores the mal-intent behind terrorism, its impact
on way of life and ideology.
FB comment 2 4 3 Irrelevant
Table: Examples of messages in the dataset.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 37/60
69. Feature Extraction
Feature Set
Feature Set #Distinct Features
PoS-tags 54
Chunk tags 23
NE tags 11
Total number of PoS/Chunk tags 2
Total number of Named Entities 1
Total number of positive/neutral/negative words 3
Total number of characters/tokens 2
Total number/proportion of all capitalized words 2
LDA topic distribution 20
Token 1-3grams 2711 (f ≥ 3 )
Lemma 1-5grams top-750 (f ≥ 1 )
Stem 1-5grams top-750 (f ≥ 1 )
PoS 1-5grams (1-5) top-125 (f ≥ 1 )
Chunk 1-5grams (1-5) top-125 (f ≥ 1 )
Total 4,579
Table: Feature sets used
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 39/60
71. Baseline Experiments
Feature sets:
• Full feature set
• Part-of-speech
• Chunks
• Named entities
• Chars+Tokens+Allcaps+Allcaps-ratio
• Positive+Neutral+Negative
• LDA topic distribution
• Token n-grams
• Lemma n-grams
• Stem n-grams
• PoS n-grams
• Chunk n-grams
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 41/60
72. Baseline Experiments
Classifiers:
• Minimum Distance Classifier
• k-Nearest Neighbors
• Naive Bayes
• Support Vector Machine
• Decision Tree
• Random Forest Classifier
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 42/60
73. Baseline Experiments
Performance Metrics:
• Accuracy
• Precision
• Recall
• F1
• Area Under the Curve (AUC)
• Average Precision (AP)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 43/60
75. Baseline Experiments
Best ROC Curves
Figure: ROC Curves of a SVM Classifier using PoS tags as features
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 45/60
76. Baseline Experiments
Best PR Curves
Figure: Precision-Recall Curves of a Minimum Distance Classifier using
LDA topic distributions as features
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 46/60
78. Feature Engineering
• Number of used features: 201
Preprocessing methods:
• Standardization / Normalization / Scaling
Feature Selection/Reduction methods:
• Information Gain/Gain Ratio
• Chi-square (χ2) / Fisher score / Pearson Correlation
• PCA (4 dimensions)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
79. Feature Engineering
• Number of used features: 201
Preprocessing methods:
• Standardization / Normalization / Scaling
Feature Selection/Reduction methods:
• Information Gain/Gain Ratio
• Chi-square (χ2) / Fisher score / Pearson Correlation
• PCA (4 dimensions)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
80. Feature Engineering
• Number of used features: 201
Preprocessing methods:
• Standardization / Normalization / Scaling
Feature Selection/Reduction methods:
• Information Gain/Gain Ratio
• Chi-square (χ2) / Fisher score / Pearson Correlation
• PCA (4 dimensions)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
82. Feature Engineering
Best ROC Curves
Figure: ROC Curves of a kNN Classifier using Standardization and the
Pearson Correlation Filter
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 50/60
83. Feature Engineering
Best PR Curves
Figure: Precision-Recall Curves of a Naive Bayes Classifier using
Standardization and the Pearson Correlation Filter
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 51/60
85. Predicting Relevance through Journalistic Criteria
Overview
Figure: Prediction of Relevance using Journalistic Criteria
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 53/60
86. Predicting Relevance through Journalistic Criteria
Results
Relevance based on Journalistic Criteria
Performance Metrics
Accuracy Precision Recall F1 AP AUC
Intermediate Classifiers
Minimum Distance Classifiers 0.62 ± 0.11 0.66 ± 0.17 0.89 ± 0.19 0.72 ± 0.08 0.80 ± 0.06 0.59 ± 0.13
K-Nearest Neighbors 0.54 ± 0.08 0.63 ± 0.14 0.57 ± 0.17 0.57 ± 0.08 0.72 ± 0.05 0.54 ± 0.09
Naive Bayes 0.56 ± 0.01 0.56 ± 0.01 0.97 ± 0.03 0.71 ± 0.01 0.77 ± 0.01 0.52 ± 0.01
Linear SVMs 0.54 ± 0.03 0.56 ± 0.02 0.89 ± 0.08 0.68 ± 0.03 0.75 ± 0.02 0.50 ± 0.04
Decision Trees 0.55 ± 0.05 0.57 ± 0.03 0.78 ± 0.11 0.65 ± 0.05 0.73 ± 0.03 0.52 ± 0.05
Random Forests 0.79 ± 0.07 0.80 ± 0.08 0.84 ± 0.07 0.82 ± 0.06 0.86 ± 0.04 0.78 ± 0.08
Table: Results on Predicting Relevance by an Ensemble of Journalistic
Classifiers
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 54/60
87. Predicting Relevance through Journalistic Criteria
Best ROC Curves
Figure: ROC Curves of a Journalistic Based kNN Classifier, using
Random Forests for the intermediate classifiers
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 55/60
88. Predicting Relevance through Journalistic Criteria
Best PR Curves
Figure: Precision-Recall Curves of a Journalistic Based kNN Classifier,
using Random Forests for the intermediate classifiers
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 56/60
90. Conclusion
Final Remarks
• Under the scope of the REMINDS project, a classifier was
created using exclusively linguistic features.
• Feature engineering leads to slightly better results, but the
baseline experiments are still competitive
• Future integration with other filters which consider other
features.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
91. Conclusion
Final Remarks
• Under the scope of the REMINDS project, a classifier was
created using exclusively linguistic features.
• Feature engineering leads to slightly better results, but the
baseline experiments are still competitive
• Future integration with other filters which consider other
features.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
92. Conclusion
Final Remarks
• Under the scope of the REMINDS project, a classifier was
created using exclusively linguistic features.
• Feature engineering leads to slightly better results, but the
baseline experiments are still competitive
• Future integration with other filters which consider other
features.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
93. Conclusion
Final Remarks
• Best approach uses an ensemble of classifiers targeted to each
one of the journalistic criteria with linguistic features
extracted from text, achieving a F1 score of 0.82 and an AUC
of 0.78.
• Results are in line with state of the art results that follow
similar approaches but for classifying documents according to
other criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 59/60
94. Conclusion
Final Remarks
• Best approach uses an ensemble of classifiers targeted to each
one of the journalistic criteria with linguistic features
extracted from text, achieving a F1 score of 0.82 and an AUC
of 0.78.
• Results are in line with state of the art results that follow
similar approaches but for classifying documents according to
other criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 59/60
95. Classification of Social Media Posts
according to their Relevance
Author:
Alexandre Pinto
Advisors:
Prof. Dr. Hugo Gon¸calo Oliveira
Prof. Dr. Ana Oliveira Alves
Faculty of Sciences and Technology
Department of Informatics Engineering
University of Coimbra
September 9, 2016
96. References
A. Pinto, H. Gon¸calo Oliveira, and A. Oliveira Alves.
Comparing the Performance of Different NLP Toolkits in
Formal and Social Media Text.
In Marjan Mernik, Jos´e Paulo Leal, and Hugo Gon¸calo Oliveira,
editors, 5th Symposium on Languages, Applications and
Technologies (SLATE’16), volume 51 of OpenAccess Series in
Informatics (OASIcs), pages 1–16, Dagstuhl, Germany, 2016.
Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
ISBN 978-3-95977-006-4.
doi: http://dx.doi.org/10.4230/OASIcs.SLATE.2016.3.
URL
http://drops.dagstuhl.de/opus/volltexte/2016/6008.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 1/1