SlideShare a Scribd company logo
1 of 96
Download to read offline
Classification of Social Media Posts
according to their Relevance
Author:
Alexandre Pinto
Advisors:
Prof. Dr. Hugo Gon¸calo Oliveira
Prof. Dr. Ana Oliveira Alves
Faculty of Sciences and Technology
Department of Informatics Engineering
University of Coimbra
September 9, 2016
Summary
Contents
1 Introduction
2 Objectives
3 Benchmarking NLP Toolkits
4 Relevance Detection
5 Conclusions
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 2/60
Introduction
Introduction
REMINDS
REMINDS = RElevance MINing and Detection System
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 4/60
Introduction
REMINDS
REMINDS = RElevance MINing and Detection System
Main Goal:
• Development of a system capable of detecting relevant
information, according to journalistic criteria, published in
social networks while ignoring irrelevant information such
as private comments and personal information, or public text
that is not important.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 5/60
Introduction
REMINDS
REMINDS = RElevance MINing and Detection System
Four Main Approaches/Four Different Teams:
• Text Mining
• Sentiment Analysis
• Interaction Patterns and Network Topologies
• Natural Language Processing (NLP)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 6/60
Introduction
What is Relevance ?
Definition:
• “The degree to which something is related or useful to what
is happening or being talked about”
Human notion:
• Hard to measure and define
Ambiguos nature:
• Cannot simply search for it (Information Retrieval)
• Must instead filter out irrelevant content (Classification)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
Introduction
What is Relevance ?
Definition:
• “The degree to which something is related or useful to what
is happening or being talked about”
Human notion:
• Hard to measure and define
Ambiguos nature:
• Cannot simply search for it (Information Retrieval)
• Must instead filter out irrelevant content (Classification)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
Introduction
What is Relevance ?
Definition:
• “The degree to which something is related or useful to what
is happening or being talked about”
Human notion:
• Hard to measure and define
Ambiguos nature:
• Cannot simply search for it (Information Retrieval)
• Must instead filter out irrelevant content (Classification)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
Objectives
Objectives
Goal
• Automatic classification of public social data according to
their potential relevance to a general audience, filtering out
irrelevant information.
• Rely primarily on linguistic features, extracted with the help
of existing NLP tools
• Confirm if relevance can be predicted from a set of
journalistic criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
Objectives
Goal
• Automatic classification of public social data according to
their potential relevance to a general audience, filtering out
irrelevant information.
• Rely primarily on linguistic features, extracted with the help
of existing NLP tools
• Confirm if relevance can be predicted from a set of
journalistic criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
Objectives
Goal
• Automatic classification of public social data according to
their potential relevance to a general audience, filtering out
irrelevant information.
• Rely primarily on linguistic features, extracted with the help
of existing NLP tools
• Confirm if relevance can be predicted from a set of
journalistic criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
Objectives
The Big Picture
Figure: System Overview
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 10/60
Objectives
The Big Picture
Figure: System Overview (Focus of this work)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 11/60
Benchmarking NLP
Toolkits
Why Benchmarking NLP Toolkits
[Pinto et al.(2016)Pinto, Oliveira, and Alves]
For widely-spoken languages, such as English:
• Wide range of NLP toolkits
• Complex applications do not have to be developed from
scratch
• Difficult choice among the available tools
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 13/60
Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
Why Benchmarking NLP Toolkits
User Choices
Aspects to consider:
• Community of users
• Frequency of new versions and updates
• Cost of integration
• Programming language
• Covered tasks
• Performance (with formal and social media text)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
Benchmarking NLP Toolkits
Workplan
Methodology:
• Choose a range of NLP toolkits
• Use of default configurations (pre-trained models)
• Perform a set of standard tasks
• Use of popular datasets that cover newspaper and social
network text
• Analyse results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
Addressed Tasks
Addressed Tasks
Lower-level NLP Tasks
Figure: Tokenization1
1
www.nltk.org/book/ch07.html
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
Addressed Tasks
Lower-level NLP Tasks
Figure: Tokenization1
Figure: Part-of-Speech (POS) Tagging1
1
www.nltk.org/book/ch07.html
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
Addressed Tasks
Lower-level NLP Tasks
Figure: Chunking1
1
www.nltk.org/book/ch07.html
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
Addressed Tasks
Lower-level NLP Tasks
Figure: Chunking1
Figure: Name Entity Recognition/Classification2
1
www.nltk.org/book/ch07.html
2
stanfordnlp.github.io/CoreNLP
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
Used Datasets
Used Datasets
• Public datasets to assess the performance of NLP tools and
thus making decisions
• Well-known and widely used in text classification
research, such as training and evaluating new tools
• Different gold standard datasets that cover different kinds
of text – newspaper and social media
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
Used Datasets
• Public datasets to assess the performance of NLP tools and
thus making decisions
• Well-known and widely used in text classification
research, such as training and evaluating new tools
• Different gold standard datasets that cover different kinds
of text – newspaper and social media
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
Used Datasets
• Public datasets to assess the performance of NLP tools and
thus making decisions
• Well-known and widely used in text classification
research, such as training and evaluating new tools
• Different gold standard datasets that cover different kinds
of text – newspaper and social media
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
Used Datasets
Newspaper and Social Media
CoNLL-2003 shared task data
• Collection of news wire
articles from the Reuters
Corpus (PoS,Chunk,NER)
Alan Ritter Twitter dataset
• Collection of randomly
sampled tweets
(PoS,Chunk,NER)
MSM 2013 workshop
• Collection of randomly
sampled tweets (NER)
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
Used Datasets
Newspaper and Social Media
CoNLL-2003 shared task data
• Collection of news wire
articles from the Reuters
Corpus (PoS,Chunk,NER)
Alan Ritter Twitter dataset
• Collection of randomly
sampled tweets
(PoS,Chunk,NER)
MSM 2013 workshop
• Collection of randomly
sampled tweets (NER)
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
Used Datasets
Newspaper and Social Media
CoNLL-2003 shared task data
• Collection of news wire
articles from the Reuters
Corpus (PoS,Chunk,NER)
Alan Ritter Twitter dataset
• Collection of randomly
sampled tweets
(PoS,Chunk,NER)
MSM 2013 workshop
• Collection of randomly
sampled tweets (NER)
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
Used Datasets
Newspaper and Social Media
CoNLL-2003 shared task data
• Collection of news wire
articles from the Reuters
Corpus (PoS,Chunk,NER)
Alan Ritter Twitter dataset
• Collection of randomly
sampled tweets
(PoS,Chunk,NER)
MSM 2013 workshop
• Collection of randomly
sampled tweets (NER)
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
Used Datasets
Newspaper and Social Media
PoS:
• Penn Treebank style
(CoNLL2003)
• PTB + twitter-specific tags
(@usernames, #hashtags,
and urls) (Ritter)
Chunking Format:
• IOB-TYPE format
Named Entities:
• PER, LOC, ORG or MISC
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
Used Datasets
Newspaper and Social Media
PoS:
• Penn Treebank style
(CoNLL2003)
• PTB + twitter-specific tags
(@usernames, #hashtags,
and urls) (Ritter)
Chunking Format:
• IOB-TYPE format
Named Entities:
• PER, LOC, ORG or MISC
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
Used Datasets
Newspaper and Social Media
PoS:
• Penn Treebank style
(CoNLL2003)
• PTB + twitter-specific tags
(@usernames, #hashtags,
and urls) (Ritter)
Chunking Format:
• IOB-TYPE format
Named Entities:
• PER, LOC, ORG or MISC
Format
Token POS Syntactic Chunk Named Entity
Only RB B-NP O
France NNP I-NP LOC
and CC I-NP O
Britain NNP I-NP LOC
backed VBD B-VP O
Fischler NNP B-NP PER
’s POS B-NP O
proposal NN I-NP O
. . O O
Table: Example of the Annotated
Data Format
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
Used Datasets
Statistics
Dataset Documents Tokens Average Tokens per Document
CoNLL (Reuter Corpus) 946 203621 215
Twitter (Alan Ritter) 2394 46469 19
#MSM2013 2815 52124 19
Table: Dataset properties
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 22/60
Compared Tools
Compared Tools
Standard vs Social NLP toolkits
Standard NLP toolkits:
• NLTK
• Apache OpenNLP
• Stanford CoreNLP
• Pattern
Social Network-Oriented Toolkits:
• TwitterNLP
• TweetNLP
• TwitIE
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 24/60
Compared Tools
Standard vs Social NLP toolkits
Standard NLP toolkits:
• NLTK
• Apache OpenNLP
• Stanford CoreNLP
• Pattern
Social Network-Oriented Toolkits:
• TwitterNLP
• TweetNLP
• TwitIE
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 24/60
Compared Tools
Tools Summary
System Programming Target Text Tok- PoS Chunking NER
Language enization tagging
NLTK Python Generic    
OpenNLP Java Generic    
CoreNLP Java Generic    
Pattern Python Generic    
TweetNLP Java Social Media    
TwitterNLP Python Social Media    
TwitIE Java Social Media    
Table: Toolkit properties
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 25/60
Comparison Results
Comparison Results
Dataset CoNLL Alan Ritter - Twitter
Task PoS Chunking NEC PoS Chunking NEC
PPPPPPPPPTool
Metric
F1 ± σ F1 ± σ F1 ± σ F1 ± σ F1 ± σ F1 ± σ
OpenNLP 0.88 ± 0.10 0.83 ± 0.12 0.87 ± 0.09 0.71 ± 0.17 0.45 ± 0.39 0.87 ± 0.13
TweetNLP 0.84 ± 0.09 n/a n/a 0.95 ± 0.07 n/a n/a
TwitterNLP 0.83 ± 0.15 0.83 ± 0.13 0.85 ± 0.12 0.92 ± 0.11 0.90 ± 0.11 0.95 ± 0.08
Table: Best Performance Results
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 27/60
Comparison Results
Discussion
• Common NLP tools usually have good performance on
well-formed content, such as news
• Noisy and informal text, such as tweets, brings new
challenges, decreasing the performance
• Special tailored tools such as CMU TweetNLP and Twitter
NLP perform good on social media text and were used in the
feature extraction process.
• General purpose tools offer better support and are more
customizable (accept new trained models)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
Comparison Results
Discussion
• Common NLP tools usually have good performance on
well-formed content, such as news
• Noisy and informal text, such as tweets, brings new
challenges, decreasing the performance
• Special tailored tools such as CMU TweetNLP and Twitter
NLP perform good on social media text and were used in the
feature extraction process.
• General purpose tools offer better support and are more
customizable (accept new trained models)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
Comparison Results
Discussion
• Common NLP tools usually have good performance on
well-formed content, such as news
• Noisy and informal text, such as tweets, brings new
challenges, decreasing the performance
• Special tailored tools such as CMU TweetNLP and Twitter
NLP perform good on social media text and were used in the
feature extraction process.
• General purpose tools offer better support and are more
customizable (accept new trained models)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
Comparison Results
Discussion
• Common NLP tools usually have good performance on
well-formed content, such as news
• Noisy and informal text, such as tweets, brings new
challenges, decreasing the performance
• Special tailored tools such as CMU TweetNLP and Twitter
NLP perform good on social media text and were used in the
feature extraction process.
• General purpose tools offer better support and are more
customizable (accept new trained models)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
Relevance Detection
Relevance Detection
Methods used in this work
Definition of Relevance NLP Tasks Machine Learning Methods
Criteria3 Extraction Preprocessing Selection Reduction Models Evaluation
Controversialness Part-of-Speech Standardization Info. Gain PCA MDC Accuracy
Informativeness Chunking Normalization Gain Ratio kNN Precision
Meaningfulness Named Entities Scaling Fisher NB Recall
Novelty Polarity of words Pearson SVM F1
Reliability LDA topics Chi-square DT ROC
Scope N-gram RF AP
Stemming k-Fold-CV
Lemmatization
Table: Methods used in this work
3
Journalistic criteria established by CRACS@INESC-TEC
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 30/60
Related Work
Author(s) Mohammad
et. al
Sriram
Fernandes
et. al
Guerini
et. al
Zeng
et. al
Irani
et. al
Lee
et. al
Frain
et. al
Liparas
et. alFeature Groups
word ngrams         
char ngrams         
all-caps         
POS         
#hashtags         
punctuation         
emoticons         
elongated words         
clusters         
authorship info.         
digital media         
#words         
#links         
lenght         
LDA topics         
polarity         
lemmas         
TF-IDF         
#profanity         
Target Class Sentiment
News’s
type
Popularity Buzz
Helpful
Opinion
Trending
Content
Trending
Categories
Satiric
Content
Topic
Source Data Twitter Twitter Mashable Digg Amazon Twitter Twitter Created News Sites
Classifier SVM SVM RF SVM SVM C4.5 NB SVM RF
Performance F1=0.69 Acc=0.96 F1=0.69 F1 =0.81 Acc=0.72 F1=0.79 Acc=0.65 F1=0.89 F1=0.85
Table: Related Work
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 31/60
Used Datasets
Relevance Detection
Used Datasets
• Textual messages gathered (by CRACS@INESC-TEC) from
Twitter and Facebook
• Text quality preferred over text quantity
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 33/60
Relevance Detection
Used Datasets
• Textual messages gathered (by CRACS@INESC-TEC) from
Twitter and Facebook
• Text quality preferred over text quantity
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 33/60
Relevance Detection
Used Datasets
Twitter search queries:
• “refugees” and “Syria”
• “elections” and “US”
• “Olympic Games”
• “terrorism”
• “Daesh”
Official Facebook pages:
• Euronews, CNN, Washington Post, Financial Times, New
York Post, The New York Times, BBC News, The Telegraph,
The Guardian, The Huffington Post, Der Spiegel
International, Deutsche Welle News, Pravda and Fox News.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 34/60
Relevance Detection
Used Datasets
Twitter search queries:
• “refugees” and “Syria”
• “elections” and “US”
• “Olympic Games”
• “terrorism”
• “Daesh”
Official Facebook pages:
• Euronews, CNN, Washington Post, Financial Times, New
York Post, The New York Times, BBC News, The Telegraph,
The Guardian, The Huffington Post, Der Spiegel
International, Deutsche Welle News, Pravda and Fox News.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 34/60
Relevance Detection
Used Datasets
• The same method was used with other journalistic criteria,
such as: interestingness, controversy, meaningfulness,
novelty, reliability and scope.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 35/60
Relevance Detection
Used Datasets
#Facebook Posts #Facebook Comments #Tweets
Search Word Relevant Irrelevant Relevant Irrelevant Relevant Irrelevant
“Refugees” + “Syria” 20 4 30 13 55 23
“Elections” + “US’ 21 8 21 14 29 39
“Olympic Games” 2 0 4 1 22 114
“Terrorism” 53 16 138 88 59 53
“Daesh” 2 0 14 12 26 30
“Referendum” + “UK” + “EU” 4 0 7 1 14 4
Table: Documents grouped by source, relevance label and query
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 36/60
Relevance Detection
Used Datasets
Content Source
Answers
Class
A1 A2 A3
Putin: Turkey supports terrorism and stabs Russia in the
back
FB post 5 4 5 Relevant
Canada to accept additional 10,000 Syrian refugees Tweet 4 5 5 Relevant
Lololol winning the internet and stomping out daesh
#merica
Tweet 1 1 1 Irrelevant
Comparing numbers of people killed by terrorism with
numbers killed by slipping in bath tub is stupid as eff. It
totally ignores the mal-intent behind terrorism, its impact
on way of life and ideology.
FB comment 2 4 3 Irrelevant
Table: Examples of messages in the dataset.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 37/60
Feature Extraction
Feature Extraction
Feature Set
Feature Set #Distinct Features
PoS-tags 54
Chunk tags 23
NE tags 11
Total number of PoS/Chunk tags 2
Total number of Named Entities 1
Total number of positive/neutral/negative words 3
Total number of characters/tokens 2
Total number/proportion of all capitalized words 2
LDA topic distribution 20
Token 1-3grams 2711 (f ≥ 3 )
Lemma 1-5grams top-750 (f ≥ 1 )
Stem 1-5grams top-750 (f ≥ 1 )
PoS 1-5grams (1-5) top-125 (f ≥ 1 )
Chunk 1-5grams (1-5) top-125 (f ≥ 1 )
Total 4,579
Table: Feature sets used
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 39/60
Baseline Experiments
Baseline Experiments
Feature sets:
• Full feature set
• Part-of-speech
• Chunks
• Named entities
• Chars+Tokens+Allcaps+Allcaps-ratio
• Positive+Neutral+Negative
• LDA topic distribution
• Token n-grams
• Lemma n-grams
• Stem n-grams
• PoS n-grams
• Chunk n-grams
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 41/60
Baseline Experiments
Classifiers:
• Minimum Distance Classifier
• k-Nearest Neighbors
• Naive Bayes
• Support Vector Machine
• Decision Tree
• Random Forest Classifier
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 42/60
Baseline Experiments
Performance Metrics:
• Accuracy
• Precision
• Recall
• F1
• Area Under the Curve (AUC)
• Average Precision (AP)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 43/60
Baseline Experiments
Results
Classifier Minimum Distance Classifier
Performance Metrics
Accuracy Precision Recall F1 AP AUC
Feature Set
Full feature set 0.57 ± 0.05 0.72 ± 0.14 0.41 ± 0.14 0.50 ± 0.11 0.73 ± 0.07 0.59 ± 0.05
Part-of-speech 0.57 ± 0.06 0.71 ± 0.15 0.42 ± 0.14 0.50 ± 0.11 0.72 ± 0.07 0.58 ± 0.06
Chunks 0.57 ± 0.05 0.71 ± 0.14 0.42 ± 0.13 0.51 ± 0.10 0.72 ± 0.07 0.59 ± 0.05
Named entities 0.57 ± 0.05 0.71 ± 0.14 0.42 ± 0.13 0.51 ± 0.10 0.72 ± 0.07 0.58 ± 0.05
Chars+Tokens
Allcaps+Allcaps-ratio
0.57 ± 0.06 0.71 ± 0.14 0.41 ± 0.14 0.50 ± 0.11 0.73 ± 0.07 0.59 ± 0.05
Positive+Neutral+Negative 0.56 ± 0.05 0.70 ± 0.15 0.41 ± 0.13 0.50 ± 0.11 0.72 ± 0.07 0.58 ± 0.05
LDA topic distribution 0.63 ± 0.15 0.65 ± 0.17 0.89 ± 0.17 0.73 ± 0.12 0.80 ± 0.09 0.60 ± 0.17
Token n-grams 0.57 ± 0.06 0.70 ± 0.16 0.46 ± 0.15 0.53 ± 0.11 0.73 ± 0.08 0.58 ± 0.06
Lemma n-grams 0.58 ± 0.05 0.71 ± 0.14 0.48 ± 0.15 0.55 ± 0.10 0.74 ± 0.07 0.60 ± 0.05
Stem n-grams 0.59 ± 0.05 0.72 ± 0.13 0.48 ± 0.15 0.55 ± 0.10 0.74 ± 0.06 0.60 ± 0.05
PoS n-grams 0.58 ± 0.05 0.71 ± 0.11 0.46 ± 0.16 0.53 ± 0.12 0.73 ± 0.05 0.59 ± 0.05
Chunk n-grams 0.57 ± 0.06 0.70 ± 0.15 0.43 ± 0.14 0.51 ± 0.11 0.72 ± 0.07 0.59 ± 0.05
Table: Baseline Results for the Minimum Distance Classifier
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 44/60
Baseline Experiments
Best ROC Curves
Figure: ROC Curves of a SVM Classifier using PoS tags as features
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 45/60
Baseline Experiments
Best PR Curves
Figure: Precision-Recall Curves of a Minimum Distance Classifier using
LDA topic distributions as features
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 46/60
Feature Engineering
Feature Engineering
• Number of used features: 201
Preprocessing methods:
• Standardization / Normalization / Scaling
Feature Selection/Reduction methods:
• Information Gain/Gain Ratio
• Chi-square (χ2) / Fisher score / Pearson Correlation
• PCA (4 dimensions)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
Feature Engineering
• Number of used features: 201
Preprocessing methods:
• Standardization / Normalization / Scaling
Feature Selection/Reduction methods:
• Information Gain/Gain Ratio
• Chi-square (χ2) / Fisher score / Pearson Correlation
• PCA (4 dimensions)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
Feature Engineering
• Number of used features: 201
Preprocessing methods:
• Standardization / Normalization / Scaling
Feature Selection/Reduction methods:
• Information Gain/Gain Ratio
• Chi-square (χ2) / Fisher score / Pearson Correlation
• PCA (4 dimensions)
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
Feature Engineering
Results
Classifier Support Vector Machine (SVM)
Performance Metrics
Accuracy Precision Recall F1 AP AUC
Pipeline applied
Standardization + full feature set 0.58 ± 0.12 0.65 ± 0.14 0.57 ± 0.21 0.58 ± 0.17 0.73 ± 0.10 0.58 ± 0.12
Normalization + full feature set 0.58 ± 0.09 0.66 ± 0.13 0.58 ± 0.16 0.60 ± 0.10 0.74 ± 0.07 0.59 ± 0.10
Scaling[0,1] + full feature set 0.59 ± 0.11 0.65 ± 0.15 0.61 ± 0.19 0.61 ± 0.14 0.74 ± 0.09 0.59 ± 0.11
Standardization + information gain 0.54 ± 0.09 0.62 ± 0.14 0.53 ± 0.37 0.47 ± 0.28 0.71 ± 0.10 0.54 ± 0.07
Standardization + gain ratio 0.54 ± 0.06 0.57 ± 0.06 0.66 ± 0.09 0.61 ± 0.05 0.71 ± 0.04 0.52 ± 0.06
Standardization + chi square (χ2
) 0.56 ± 0.11 0.61 ± 0.20 0.58 ± 0.34 0.54 ± 0.24 0.71 ± 0.14 0.56 ± 0.11
Standardization + fisher score 0.54 ± 0.09 0.57 ± 0.09 0.54 ± 0.29 0.53 ± 0.18 0.68 ± 0.10 0.54 ± 0.08
Standardization + pearson 0.65 ± 0.16 0.66 ± 0.17 0.94 ± 0.11 0.76 ± 0.10 0.81 ± 0.08 0.61 ± 0.18
Standardization + pearson + pca4d 0.64 ± 0.16 0.65 ± 0.17 0.94 ± 0.11 0.75 ± 0.10 0.81 ± 0.09 0.61 ± 0.18
Standardization + gain ratio + pca4d 0.59 ± 0.05 0.58 ± 0.04 0.96 ± 0.10 0.72 ± 0.04 0.78 ± 0.03 0.54 ± 0.06
Table: Results of applying different Preprocessing and Feature Selection
methods with a Support Vector Machine
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 49/60
Feature Engineering
Best ROC Curves
Figure: ROC Curves of a kNN Classifier using Standardization and the
Pearson Correlation Filter
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 50/60
Feature Engineering
Best PR Curves
Figure: Precision-Recall Curves of a Naive Bayes Classifier using
Standardization and the Pearson Correlation Filter
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 51/60
Predicting Relevance
through Journalistic
Criteria
Predicting Relevance through Journalistic Criteria
Overview
Figure: Prediction of Relevance using Journalistic Criteria
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 53/60
Predicting Relevance through Journalistic Criteria
Results
Relevance based on Journalistic Criteria
Performance Metrics
Accuracy Precision Recall F1 AP AUC
Intermediate Classifiers
Minimum Distance Classifiers 0.62 ± 0.11 0.66 ± 0.17 0.89 ± 0.19 0.72 ± 0.08 0.80 ± 0.06 0.59 ± 0.13
K-Nearest Neighbors 0.54 ± 0.08 0.63 ± 0.14 0.57 ± 0.17 0.57 ± 0.08 0.72 ± 0.05 0.54 ± 0.09
Naive Bayes 0.56 ± 0.01 0.56 ± 0.01 0.97 ± 0.03 0.71 ± 0.01 0.77 ± 0.01 0.52 ± 0.01
Linear SVMs 0.54 ± 0.03 0.56 ± 0.02 0.89 ± 0.08 0.68 ± 0.03 0.75 ± 0.02 0.50 ± 0.04
Decision Trees 0.55 ± 0.05 0.57 ± 0.03 0.78 ± 0.11 0.65 ± 0.05 0.73 ± 0.03 0.52 ± 0.05
Random Forests 0.79 ± 0.07 0.80 ± 0.08 0.84 ± 0.07 0.82 ± 0.06 0.86 ± 0.04 0.78 ± 0.08
Table: Results on Predicting Relevance by an Ensemble of Journalistic
Classifiers
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 54/60
Predicting Relevance through Journalistic Criteria
Best ROC Curves
Figure: ROC Curves of a Journalistic Based kNN Classifier, using
Random Forests for the intermediate classifiers
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 55/60
Predicting Relevance through Journalistic Criteria
Best PR Curves
Figure: Precision-Recall Curves of a Journalistic Based kNN Classifier,
using Random Forests for the intermediate classifiers
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 56/60
Conclusions
Conclusion
Final Remarks
• Under the scope of the REMINDS project, a classifier was
created using exclusively linguistic features.
• Feature engineering leads to slightly better results, but the
baseline experiments are still competitive
• Future integration with other filters which consider other
features.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
Conclusion
Final Remarks
• Under the scope of the REMINDS project, a classifier was
created using exclusively linguistic features.
• Feature engineering leads to slightly better results, but the
baseline experiments are still competitive
• Future integration with other filters which consider other
features.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
Conclusion
Final Remarks
• Under the scope of the REMINDS project, a classifier was
created using exclusively linguistic features.
• Feature engineering leads to slightly better results, but the
baseline experiments are still competitive
• Future integration with other filters which consider other
features.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
Conclusion
Final Remarks
• Best approach uses an ensemble of classifiers targeted to each
one of the journalistic criteria with linguistic features
extracted from text, achieving a F1 score of 0.82 and an AUC
of 0.78.
• Results are in line with state of the art results that follow
similar approaches but for classifying documents according to
other criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 59/60
Conclusion
Final Remarks
• Best approach uses an ensemble of classifiers targeted to each
one of the journalistic criteria with linguistic features
extracted from text, achieving a F1 score of 0.82 and an AUC
of 0.78.
• Results are in line with state of the art results that follow
similar approaches but for classifying documents according to
other criteria.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 59/60
Classification of Social Media Posts
according to their Relevance
Author:
Alexandre Pinto
Advisors:
Prof. Dr. Hugo Gon¸calo Oliveira
Prof. Dr. Ana Oliveira Alves
Faculty of Sciences and Technology
Department of Informatics Engineering
University of Coimbra
September 9, 2016
References
A. Pinto, H. Gon¸calo Oliveira, and A. Oliveira Alves.
Comparing the Performance of Different NLP Toolkits in
Formal and Social Media Text.
In Marjan Mernik, Jos´e Paulo Leal, and Hugo Gon¸calo Oliveira,
editors, 5th Symposium on Languages, Applications and
Technologies (SLATE’16), volume 51 of OpenAccess Series in
Informatics (OASIcs), pages 1–16, Dagstuhl, Germany, 2016.
Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
ISBN 978-3-95977-006-4.
doi: http://dx.doi.org/10.4230/OASIcs.SLATE.2016.3.
URL
http://drops.dagstuhl.de/opus/volltexte/2016/6008.
Classification of Social Media Posts, according to their Relevance Alexandre Pinto 1/1

More Related Content

Similar to Relevance Mining and Detection System

Recommandation systems -
Recommandation systems - Recommandation systems -
Recommandation systems - Yousef Fadila
 
Narrative Mind Week 2 H4D Stanford 2016
Narrative Mind Week 2 H4D Stanford 2016Narrative Mind Week 2 H4D Stanford 2016
Narrative Mind Week 2 H4D Stanford 2016Stanford University
 
BSc Computing CSY2026 Modern Networks Date of Issue .docx
BSc Computing  CSY2026 Modern Networks Date of Issue .docxBSc Computing  CSY2026 Modern Networks Date of Issue .docx
BSc Computing CSY2026 Modern Networks Date of Issue .docxAASTHA76
 
Narrative Mind Week 6 H4D Stanford 2016
Narrative Mind Week 6 H4D Stanford 2016Narrative Mind Week 6 H4D Stanford 2016
Narrative Mind Week 6 H4D Stanford 2016Stanford University
 
Perspective presentation
Perspective presentationPerspective presentation
Perspective presentationOskar Hargedahl
 
Building State-of-the-art Natural Language Processing Projects with Free Soft...
Building State-of-the-art Natural Language Processing Projects with Free Soft...Building State-of-the-art Natural Language Processing Projects with Free Soft...
Building State-of-the-art Natural Language Processing Projects with Free Soft...David Talby
 
Task two - fmp research intro
Task two - fmp research introTask two - fmp research intro
Task two - fmp research introLouis Robinson
 
User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...University of Bergen
 
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Sonya Liberman
 
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...multimediaeval
 
A Journey With Microsoft Cognitive Services II
A Journey With Microsoft Cognitive Services IIA Journey With Microsoft Cognitive Services II
A Journey With Microsoft Cognitive Services IIMarvin Heng
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Capturing Attention How To Use The Research Translation Toolkit’s Communicati...
Capturing Attention How To Use The Research Translation Toolkit’s Communicati...Capturing Attention How To Use The Research Translation Toolkit’s Communicati...
Capturing Attention How To Use The Research Translation Toolkit’s Communicati...Francois Stepman
 
Conducting User Research
Conducting User ResearchConducting User Research
Conducting User ResearchJeremy Horn
 
Requirements Engineering for the Humanities
Requirements Engineering for the HumanitiesRequirements Engineering for the Humanities
Requirements Engineering for the HumanitiesShawn Day
 
Generating Mobile Application Onboarding Insights Through Minimalist Instruction
Generating Mobile Application Onboarding Insights Through Minimalist InstructionGenerating Mobile Application Onboarding Insights Through Minimalist Instruction
Generating Mobile Application Onboarding Insights Through Minimalist Instructioncolin gray
 
How to Build a Social Learning Community for Analytics and Insights Professio...
How to Build a Social Learning Community for Analytics and Insights Professio...How to Build a Social Learning Community for Analytics and Insights Professio...
How to Build a Social Learning Community for Analytics and Insights Professio...IntelCollab.com
 
Capstone primer for BSIT Program @DBTC
Capstone primer for BSIT Program @DBTCCapstone primer for BSIT Program @DBTC
Capstone primer for BSIT Program @DBTCRodel Barcenas
 

Similar to Relevance Mining and Detection System (20)

Recommandation systems -
Recommandation systems - Recommandation systems -
Recommandation systems -
 
Narrative Mind Week 2 H4D Stanford 2016
Narrative Mind Week 2 H4D Stanford 2016Narrative Mind Week 2 H4D Stanford 2016
Narrative Mind Week 2 H4D Stanford 2016
 
BSc Computing CSY2026 Modern Networks Date of Issue .docx
BSc Computing  CSY2026 Modern Networks Date of Issue .docxBSc Computing  CSY2026 Modern Networks Date of Issue .docx
BSc Computing CSY2026 Modern Networks Date of Issue .docx
 
Narrative Mind Week 6 H4D Stanford 2016
Narrative Mind Week 6 H4D Stanford 2016Narrative Mind Week 6 H4D Stanford 2016
Narrative Mind Week 6 H4D Stanford 2016
 
Perspective presentation
Perspective presentationPerspective presentation
Perspective presentation
 
Building State-of-the-art Natural Language Processing Projects with Free Soft...
Building State-of-the-art Natural Language Processing Projects with Free Soft...Building State-of-the-art Natural Language Processing Projects with Free Soft...
Building State-of-the-art Natural Language Processing Projects with Free Soft...
 
Task two - fmp research intro
Task two - fmp research introTask two - fmp research intro
Task two - fmp research intro
 
User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...
 
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019
 
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
 
A Journey With Microsoft Cognitive Services II
A Journey With Microsoft Cognitive Services IIA Journey With Microsoft Cognitive Services II
A Journey With Microsoft Cognitive Services II
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Capturing Attention How To Use The Research Translation Toolkit’s Communicati...
Capturing Attention How To Use The Research Translation Toolkit’s Communicati...Capturing Attention How To Use The Research Translation Toolkit’s Communicati...
Capturing Attention How To Use The Research Translation Toolkit’s Communicati...
 
Conducting User Research
Conducting User ResearchConducting User Research
Conducting User Research
 
Requirements Engineering for the Humanities
Requirements Engineering for the HumanitiesRequirements Engineering for the Humanities
Requirements Engineering for the Humanities
 
Generating Mobile Application Onboarding Insights Through Minimalist Instruction
Generating Mobile Application Onboarding Insights Through Minimalist InstructionGenerating Mobile Application Onboarding Insights Through Minimalist Instruction
Generating Mobile Application Onboarding Insights Through Minimalist Instruction
 
G325 Section A
G325 Section AG325 Section A
G325 Section A
 
How to Build a Social Learning Community for Analytics and Insights Professio...
How to Build a Social Learning Community for Analytics and Insights Professio...How to Build a Social Learning Community for Analytics and Insights Professio...
How to Build a Social Learning Community for Analytics and Insights Professio...
 
Capstone primer for BSIT Program @DBTC
Capstone primer for BSIT Program @DBTCCapstone primer for BSIT Program @DBTC
Capstone primer for BSIT Program @DBTC
 
Mood
MoodMood
Mood
 

Recently uploaded

JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 

Recently uploaded (20)

JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 

Relevance Mining and Detection System

  • 1. Classification of Social Media Posts according to their Relevance Author: Alexandre Pinto Advisors: Prof. Dr. Hugo Gon¸calo Oliveira Prof. Dr. Ana Oliveira Alves Faculty of Sciences and Technology Department of Informatics Engineering University of Coimbra September 9, 2016
  • 2. Summary Contents 1 Introduction 2 Objectives 3 Benchmarking NLP Toolkits 4 Relevance Detection 5 Conclusions Classification of Social Media Posts, according to their Relevance Alexandre Pinto 2/60
  • 4. Introduction REMINDS REMINDS = RElevance MINing and Detection System Classification of Social Media Posts, according to their Relevance Alexandre Pinto 4/60
  • 5. Introduction REMINDS REMINDS = RElevance MINing and Detection System Main Goal: • Development of a system capable of detecting relevant information, according to journalistic criteria, published in social networks while ignoring irrelevant information such as private comments and personal information, or public text that is not important. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 5/60
  • 6. Introduction REMINDS REMINDS = RElevance MINing and Detection System Four Main Approaches/Four Different Teams: • Text Mining • Sentiment Analysis • Interaction Patterns and Network Topologies • Natural Language Processing (NLP) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 6/60
  • 7. Introduction What is Relevance ? Definition: • “The degree to which something is related or useful to what is happening or being talked about” Human notion: • Hard to measure and define Ambiguos nature: • Cannot simply search for it (Information Retrieval) • Must instead filter out irrelevant content (Classification) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
  • 8. Introduction What is Relevance ? Definition: • “The degree to which something is related or useful to what is happening or being talked about” Human notion: • Hard to measure and define Ambiguos nature: • Cannot simply search for it (Information Retrieval) • Must instead filter out irrelevant content (Classification) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
  • 9. Introduction What is Relevance ? Definition: • “The degree to which something is related or useful to what is happening or being talked about” Human notion: • Hard to measure and define Ambiguos nature: • Cannot simply search for it (Information Retrieval) • Must instead filter out irrelevant content (Classification) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 7/60
  • 11. Objectives Goal • Automatic classification of public social data according to their potential relevance to a general audience, filtering out irrelevant information. • Rely primarily on linguistic features, extracted with the help of existing NLP tools • Confirm if relevance can be predicted from a set of journalistic criteria. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
  • 12. Objectives Goal • Automatic classification of public social data according to their potential relevance to a general audience, filtering out irrelevant information. • Rely primarily on linguistic features, extracted with the help of existing NLP tools • Confirm if relevance can be predicted from a set of journalistic criteria. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
  • 13. Objectives Goal • Automatic classification of public social data according to their potential relevance to a general audience, filtering out irrelevant information. • Rely primarily on linguistic features, extracted with the help of existing NLP tools • Confirm if relevance can be predicted from a set of journalistic criteria. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 9/60
  • 14. Objectives The Big Picture Figure: System Overview Classification of Social Media Posts, according to their Relevance Alexandre Pinto 10/60
  • 15. Objectives The Big Picture Figure: System Overview (Focus of this work) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 11/60
  • 17. Why Benchmarking NLP Toolkits [Pinto et al.(2016)Pinto, Oliveira, and Alves] For widely-spoken languages, such as English: • Wide range of NLP toolkits • Complex applications do not have to be developed from scratch • Difficult choice among the available tools Classification of Social Media Posts, according to their Relevance Alexandre Pinto 13/60
  • 18. Why Benchmarking NLP Toolkits User Choices Aspects to consider: • Community of users • Frequency of new versions and updates • Cost of integration • Programming language • Covered tasks • Performance (with formal and social media text) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
  • 19. Why Benchmarking NLP Toolkits User Choices Aspects to consider: • Community of users • Frequency of new versions and updates • Cost of integration • Programming language • Covered tasks • Performance (with formal and social media text) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
  • 20. Why Benchmarking NLP Toolkits User Choices Aspects to consider: • Community of users • Frequency of new versions and updates • Cost of integration • Programming language • Covered tasks • Performance (with formal and social media text) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
  • 21. Why Benchmarking NLP Toolkits User Choices Aspects to consider: • Community of users • Frequency of new versions and updates • Cost of integration • Programming language • Covered tasks • Performance (with formal and social media text) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
  • 22. Why Benchmarking NLP Toolkits User Choices Aspects to consider: • Community of users • Frequency of new versions and updates • Cost of integration • Programming language • Covered tasks • Performance (with formal and social media text) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
  • 23. Why Benchmarking NLP Toolkits User Choices Aspects to consider: • Community of users • Frequency of new versions and updates • Cost of integration • Programming language • Covered tasks • Performance (with formal and social media text) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
  • 24. Why Benchmarking NLP Toolkits User Choices Aspects to consider: • Community of users • Frequency of new versions and updates • Cost of integration • Programming language • Covered tasks • Performance (with formal and social media text) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 14/60
  • 25. Benchmarking NLP Toolkits Workplan Methodology: • Choose a range of NLP toolkits • Use of default configurations (pre-trained models) • Perform a set of standard tasks • Use of popular datasets that cover newspaper and social network text • Analyse results Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
  • 26. Benchmarking NLP Toolkits Workplan Methodology: • Choose a range of NLP toolkits • Use of default configurations (pre-trained models) • Perform a set of standard tasks • Use of popular datasets that cover newspaper and social network text • Analyse results Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
  • 27. Benchmarking NLP Toolkits Workplan Methodology: • Choose a range of NLP toolkits • Use of default configurations (pre-trained models) • Perform a set of standard tasks • Use of popular datasets that cover newspaper and social network text • Analyse results Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
  • 28. Benchmarking NLP Toolkits Workplan Methodology: • Choose a range of NLP toolkits • Use of default configurations (pre-trained models) • Perform a set of standard tasks • Use of popular datasets that cover newspaper and social network text • Analyse results Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
  • 29. Benchmarking NLP Toolkits Workplan Methodology: • Choose a range of NLP toolkits • Use of default configurations (pre-trained models) • Perform a set of standard tasks • Use of popular datasets that cover newspaper and social network text • Analyse results Classification of Social Media Posts, according to their Relevance Alexandre Pinto 15/60
  • 31. Addressed Tasks Lower-level NLP Tasks Figure: Tokenization1 1 www.nltk.org/book/ch07.html Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
  • 32. Addressed Tasks Lower-level NLP Tasks Figure: Tokenization1 Figure: Part-of-Speech (POS) Tagging1 1 www.nltk.org/book/ch07.html Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
  • 33. Addressed Tasks Lower-level NLP Tasks Figure: Chunking1 1 www.nltk.org/book/ch07.html Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
  • 34. Addressed Tasks Lower-level NLP Tasks Figure: Chunking1 Figure: Name Entity Recognition/Classification2 1 www.nltk.org/book/ch07.html 2 stanfordnlp.github.io/CoreNLP Classification of Social Media Posts, according to their Relevance Alexandre Pinto 17/60
  • 36. Used Datasets • Public datasets to assess the performance of NLP tools and thus making decisions • Well-known and widely used in text classification research, such as training and evaluating new tools • Different gold standard datasets that cover different kinds of text – newspaper and social media Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
  • 37. Used Datasets • Public datasets to assess the performance of NLP tools and thus making decisions • Well-known and widely used in text classification research, such as training and evaluating new tools • Different gold standard datasets that cover different kinds of text – newspaper and social media Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
  • 38. Used Datasets • Public datasets to assess the performance of NLP tools and thus making decisions • Well-known and widely used in text classification research, such as training and evaluating new tools • Different gold standard datasets that cover different kinds of text – newspaper and social media Classification of Social Media Posts, according to their Relevance Alexandre Pinto 19/60
  • 39. Used Datasets Newspaper and Social Media CoNLL-2003 shared task data • Collection of news wire articles from the Reuters Corpus (PoS,Chunk,NER) Alan Ritter Twitter dataset • Collection of randomly sampled tweets (PoS,Chunk,NER) MSM 2013 workshop • Collection of randomly sampled tweets (NER) Format Token POS Syntactic Chunk Named Entity Only RB B-NP O France NNP I-NP LOC and CC I-NP O Britain NNP I-NP LOC backed VBD B-VP O Fischler NNP B-NP PER ’s POS B-NP O proposal NN I-NP O . . O O Table: Example of the Annotated Data Format Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
  • 40. Used Datasets Newspaper and Social Media CoNLL-2003 shared task data • Collection of news wire articles from the Reuters Corpus (PoS,Chunk,NER) Alan Ritter Twitter dataset • Collection of randomly sampled tweets (PoS,Chunk,NER) MSM 2013 workshop • Collection of randomly sampled tweets (NER) Format Token POS Syntactic Chunk Named Entity Only RB B-NP O France NNP I-NP LOC and CC I-NP O Britain NNP I-NP LOC backed VBD B-VP O Fischler NNP B-NP PER ’s POS B-NP O proposal NN I-NP O . . O O Table: Example of the Annotated Data Format Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
  • 41. Used Datasets Newspaper and Social Media CoNLL-2003 shared task data • Collection of news wire articles from the Reuters Corpus (PoS,Chunk,NER) Alan Ritter Twitter dataset • Collection of randomly sampled tweets (PoS,Chunk,NER) MSM 2013 workshop • Collection of randomly sampled tweets (NER) Format Token POS Syntactic Chunk Named Entity Only RB B-NP O France NNP I-NP LOC and CC I-NP O Britain NNP I-NP LOC backed VBD B-VP O Fischler NNP B-NP PER ’s POS B-NP O proposal NN I-NP O . . O O Table: Example of the Annotated Data Format Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
  • 42. Used Datasets Newspaper and Social Media CoNLL-2003 shared task data • Collection of news wire articles from the Reuters Corpus (PoS,Chunk,NER) Alan Ritter Twitter dataset • Collection of randomly sampled tweets (PoS,Chunk,NER) MSM 2013 workshop • Collection of randomly sampled tweets (NER) Format Token POS Syntactic Chunk Named Entity Only RB B-NP O France NNP I-NP LOC and CC I-NP O Britain NNP I-NP LOC backed VBD B-VP O Fischler NNP B-NP PER ’s POS B-NP O proposal NN I-NP O . . O O Table: Example of the Annotated Data Format Classification of Social Media Posts, according to their Relevance Alexandre Pinto 20/60
  • 43. Used Datasets Newspaper and Social Media PoS: • Penn Treebank style (CoNLL2003) • PTB + twitter-specific tags (@usernames, #hashtags, and urls) (Ritter) Chunking Format: • IOB-TYPE format Named Entities: • PER, LOC, ORG or MISC Format Token POS Syntactic Chunk Named Entity Only RB B-NP O France NNP I-NP LOC and CC I-NP O Britain NNP I-NP LOC backed VBD B-VP O Fischler NNP B-NP PER ’s POS B-NP O proposal NN I-NP O . . O O Table: Example of the Annotated Data Format Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
  • 44. Used Datasets Newspaper and Social Media PoS: • Penn Treebank style (CoNLL2003) • PTB + twitter-specific tags (@usernames, #hashtags, and urls) (Ritter) Chunking Format: • IOB-TYPE format Named Entities: • PER, LOC, ORG or MISC Format Token POS Syntactic Chunk Named Entity Only RB B-NP O France NNP I-NP LOC and CC I-NP O Britain NNP I-NP LOC backed VBD B-VP O Fischler NNP B-NP PER ’s POS B-NP O proposal NN I-NP O . . O O Table: Example of the Annotated Data Format Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
  • 45. Used Datasets Newspaper and Social Media PoS: • Penn Treebank style (CoNLL2003) • PTB + twitter-specific tags (@usernames, #hashtags, and urls) (Ritter) Chunking Format: • IOB-TYPE format Named Entities: • PER, LOC, ORG or MISC Format Token POS Syntactic Chunk Named Entity Only RB B-NP O France NNP I-NP LOC and CC I-NP O Britain NNP I-NP LOC backed VBD B-VP O Fischler NNP B-NP PER ’s POS B-NP O proposal NN I-NP O . . O O Table: Example of the Annotated Data Format Classification of Social Media Posts, according to their Relevance Alexandre Pinto 21/60
  • 46. Used Datasets Statistics Dataset Documents Tokens Average Tokens per Document CoNLL (Reuter Corpus) 946 203621 215 Twitter (Alan Ritter) 2394 46469 19 #MSM2013 2815 52124 19 Table: Dataset properties Classification of Social Media Posts, according to their Relevance Alexandre Pinto 22/60
  • 48. Compared Tools Standard vs Social NLP toolkits Standard NLP toolkits: • NLTK • Apache OpenNLP • Stanford CoreNLP • Pattern Social Network-Oriented Toolkits: • TwitterNLP • TweetNLP • TwitIE Classification of Social Media Posts, according to their Relevance Alexandre Pinto 24/60
  • 49. Compared Tools Standard vs Social NLP toolkits Standard NLP toolkits: • NLTK • Apache OpenNLP • Stanford CoreNLP • Pattern Social Network-Oriented Toolkits: • TwitterNLP • TweetNLP • TwitIE Classification of Social Media Posts, according to their Relevance Alexandre Pinto 24/60
  • 50. Compared Tools Tools Summary System Programming Target Text Tok- PoS Chunking NER Language enization tagging NLTK Python Generic OpenNLP Java Generic CoreNLP Java Generic Pattern Python Generic TweetNLP Java Social Media TwitterNLP Python Social Media TwitIE Java Social Media Table: Toolkit properties Classification of Social Media Posts, according to their Relevance Alexandre Pinto 25/60
  • 52. Comparison Results Dataset CoNLL Alan Ritter - Twitter Task PoS Chunking NEC PoS Chunking NEC PPPPPPPPPTool Metric F1 ± σ F1 ± σ F1 ± σ F1 ± σ F1 ± σ F1 ± σ OpenNLP 0.88 ± 0.10 0.83 ± 0.12 0.87 ± 0.09 0.71 ± 0.17 0.45 ± 0.39 0.87 ± 0.13 TweetNLP 0.84 ± 0.09 n/a n/a 0.95 ± 0.07 n/a n/a TwitterNLP 0.83 ± 0.15 0.83 ± 0.13 0.85 ± 0.12 0.92 ± 0.11 0.90 ± 0.11 0.95 ± 0.08 Table: Best Performance Results Classification of Social Media Posts, according to their Relevance Alexandre Pinto 27/60
  • 53. Comparison Results Discussion • Common NLP tools usually have good performance on well-formed content, such as news • Noisy and informal text, such as tweets, brings new challenges, decreasing the performance • Special tailored tools such as CMU TweetNLP and Twitter NLP perform good on social media text and were used in the feature extraction process. • General purpose tools offer better support and are more customizable (accept new trained models) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
  • 54. Comparison Results Discussion • Common NLP tools usually have good performance on well-formed content, such as news • Noisy and informal text, such as tweets, brings new challenges, decreasing the performance • Special tailored tools such as CMU TweetNLP and Twitter NLP perform good on social media text and were used in the feature extraction process. • General purpose tools offer better support and are more customizable (accept new trained models) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
  • 55. Comparison Results Discussion • Common NLP tools usually have good performance on well-formed content, such as news • Noisy and informal text, such as tweets, brings new challenges, decreasing the performance • Special tailored tools such as CMU TweetNLP and Twitter NLP perform good on social media text and were used in the feature extraction process. • General purpose tools offer better support and are more customizable (accept new trained models) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
  • 56. Comparison Results Discussion • Common NLP tools usually have good performance on well-formed content, such as news • Noisy and informal text, such as tweets, brings new challenges, decreasing the performance • Special tailored tools such as CMU TweetNLP and Twitter NLP perform good on social media text and were used in the feature extraction process. • General purpose tools offer better support and are more customizable (accept new trained models) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 28/60
  • 58. Relevance Detection Methods used in this work Definition of Relevance NLP Tasks Machine Learning Methods Criteria3 Extraction Preprocessing Selection Reduction Models Evaluation Controversialness Part-of-Speech Standardization Info. Gain PCA MDC Accuracy Informativeness Chunking Normalization Gain Ratio kNN Precision Meaningfulness Named Entities Scaling Fisher NB Recall Novelty Polarity of words Pearson SVM F1 Reliability LDA topics Chi-square DT ROC Scope N-gram RF AP Stemming k-Fold-CV Lemmatization Table: Methods used in this work 3 Journalistic criteria established by CRACS@INESC-TEC Classification of Social Media Posts, according to their Relevance Alexandre Pinto 30/60
  • 59. Related Work Author(s) Mohammad et. al Sriram Fernandes et. al Guerini et. al Zeng et. al Irani et. al Lee et. al Frain et. al Liparas et. alFeature Groups word ngrams char ngrams all-caps POS #hashtags punctuation emoticons elongated words clusters authorship info. digital media #words #links lenght LDA topics polarity lemmas TF-IDF #profanity Target Class Sentiment News’s type Popularity Buzz Helpful Opinion Trending Content Trending Categories Satiric Content Topic Source Data Twitter Twitter Mashable Digg Amazon Twitter Twitter Created News Sites Classifier SVM SVM RF SVM SVM C4.5 NB SVM RF Performance F1=0.69 Acc=0.96 F1=0.69 F1 =0.81 Acc=0.72 F1=0.79 Acc=0.65 F1=0.89 F1=0.85 Table: Related Work Classification of Social Media Posts, according to their Relevance Alexandre Pinto 31/60
  • 61. Relevance Detection Used Datasets • Textual messages gathered (by CRACS@INESC-TEC) from Twitter and Facebook • Text quality preferred over text quantity Classification of Social Media Posts, according to their Relevance Alexandre Pinto 33/60
  • 62. Relevance Detection Used Datasets • Textual messages gathered (by CRACS@INESC-TEC) from Twitter and Facebook • Text quality preferred over text quantity Classification of Social Media Posts, according to their Relevance Alexandre Pinto 33/60
  • 63. Relevance Detection Used Datasets Twitter search queries: • “refugees” and “Syria” • “elections” and “US” • “Olympic Games” • “terrorism” • “Daesh” Official Facebook pages: • Euronews, CNN, Washington Post, Financial Times, New York Post, The New York Times, BBC News, The Telegraph, The Guardian, The Huffington Post, Der Spiegel International, Deutsche Welle News, Pravda and Fox News. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 34/60
  • 64. Relevance Detection Used Datasets Twitter search queries: • “refugees” and “Syria” • “elections” and “US” • “Olympic Games” • “terrorism” • “Daesh” Official Facebook pages: • Euronews, CNN, Washington Post, Financial Times, New York Post, The New York Times, BBC News, The Telegraph, The Guardian, The Huffington Post, Der Spiegel International, Deutsche Welle News, Pravda and Fox News. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 34/60
  • 65. Relevance Detection Used Datasets • The same method was used with other journalistic criteria, such as: interestingness, controversy, meaningfulness, novelty, reliability and scope. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 35/60
  • 66. Relevance Detection Used Datasets #Facebook Posts #Facebook Comments #Tweets Search Word Relevant Irrelevant Relevant Irrelevant Relevant Irrelevant “Refugees” + “Syria” 20 4 30 13 55 23 “Elections” + “US’ 21 8 21 14 29 39 “Olympic Games” 2 0 4 1 22 114 “Terrorism” 53 16 138 88 59 53 “Daesh” 2 0 14 12 26 30 “Referendum” + “UK” + “EU” 4 0 7 1 14 4 Table: Documents grouped by source, relevance label and query Classification of Social Media Posts, according to their Relevance Alexandre Pinto 36/60
  • 67. Relevance Detection Used Datasets Content Source Answers Class A1 A2 A3 Putin: Turkey supports terrorism and stabs Russia in the back FB post 5 4 5 Relevant Canada to accept additional 10,000 Syrian refugees Tweet 4 5 5 Relevant Lololol winning the internet and stomping out daesh #merica Tweet 1 1 1 Irrelevant Comparing numbers of people killed by terrorism with numbers killed by slipping in bath tub is stupid as eff. It totally ignores the mal-intent behind terrorism, its impact on way of life and ideology. FB comment 2 4 3 Irrelevant Table: Examples of messages in the dataset. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 37/60
  • 69. Feature Extraction Feature Set Feature Set #Distinct Features PoS-tags 54 Chunk tags 23 NE tags 11 Total number of PoS/Chunk tags 2 Total number of Named Entities 1 Total number of positive/neutral/negative words 3 Total number of characters/tokens 2 Total number/proportion of all capitalized words 2 LDA topic distribution 20 Token 1-3grams 2711 (f ≥ 3 ) Lemma 1-5grams top-750 (f ≥ 1 ) Stem 1-5grams top-750 (f ≥ 1 ) PoS 1-5grams (1-5) top-125 (f ≥ 1 ) Chunk 1-5grams (1-5) top-125 (f ≥ 1 ) Total 4,579 Table: Feature sets used Classification of Social Media Posts, according to their Relevance Alexandre Pinto 39/60
  • 71. Baseline Experiments Feature sets: • Full feature set • Part-of-speech • Chunks • Named entities • Chars+Tokens+Allcaps+Allcaps-ratio • Positive+Neutral+Negative • LDA topic distribution • Token n-grams • Lemma n-grams • Stem n-grams • PoS n-grams • Chunk n-grams Classification of Social Media Posts, according to their Relevance Alexandre Pinto 41/60
  • 72. Baseline Experiments Classifiers: • Minimum Distance Classifier • k-Nearest Neighbors • Naive Bayes • Support Vector Machine • Decision Tree • Random Forest Classifier Classification of Social Media Posts, according to their Relevance Alexandre Pinto 42/60
  • 73. Baseline Experiments Performance Metrics: • Accuracy • Precision • Recall • F1 • Area Under the Curve (AUC) • Average Precision (AP) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 43/60
  • 74. Baseline Experiments Results Classifier Minimum Distance Classifier Performance Metrics Accuracy Precision Recall F1 AP AUC Feature Set Full feature set 0.57 ± 0.05 0.72 ± 0.14 0.41 ± 0.14 0.50 ± 0.11 0.73 ± 0.07 0.59 ± 0.05 Part-of-speech 0.57 ± 0.06 0.71 ± 0.15 0.42 ± 0.14 0.50 ± 0.11 0.72 ± 0.07 0.58 ± 0.06 Chunks 0.57 ± 0.05 0.71 ± 0.14 0.42 ± 0.13 0.51 ± 0.10 0.72 ± 0.07 0.59 ± 0.05 Named entities 0.57 ± 0.05 0.71 ± 0.14 0.42 ± 0.13 0.51 ± 0.10 0.72 ± 0.07 0.58 ± 0.05 Chars+Tokens Allcaps+Allcaps-ratio 0.57 ± 0.06 0.71 ± 0.14 0.41 ± 0.14 0.50 ± 0.11 0.73 ± 0.07 0.59 ± 0.05 Positive+Neutral+Negative 0.56 ± 0.05 0.70 ± 0.15 0.41 ± 0.13 0.50 ± 0.11 0.72 ± 0.07 0.58 ± 0.05 LDA topic distribution 0.63 ± 0.15 0.65 ± 0.17 0.89 ± 0.17 0.73 ± 0.12 0.80 ± 0.09 0.60 ± 0.17 Token n-grams 0.57 ± 0.06 0.70 ± 0.16 0.46 ± 0.15 0.53 ± 0.11 0.73 ± 0.08 0.58 ± 0.06 Lemma n-grams 0.58 ± 0.05 0.71 ± 0.14 0.48 ± 0.15 0.55 ± 0.10 0.74 ± 0.07 0.60 ± 0.05 Stem n-grams 0.59 ± 0.05 0.72 ± 0.13 0.48 ± 0.15 0.55 ± 0.10 0.74 ± 0.06 0.60 ± 0.05 PoS n-grams 0.58 ± 0.05 0.71 ± 0.11 0.46 ± 0.16 0.53 ± 0.12 0.73 ± 0.05 0.59 ± 0.05 Chunk n-grams 0.57 ± 0.06 0.70 ± 0.15 0.43 ± 0.14 0.51 ± 0.11 0.72 ± 0.07 0.59 ± 0.05 Table: Baseline Results for the Minimum Distance Classifier Classification of Social Media Posts, according to their Relevance Alexandre Pinto 44/60
  • 75. Baseline Experiments Best ROC Curves Figure: ROC Curves of a SVM Classifier using PoS tags as features Classification of Social Media Posts, according to their Relevance Alexandre Pinto 45/60
  • 76. Baseline Experiments Best PR Curves Figure: Precision-Recall Curves of a Minimum Distance Classifier using LDA topic distributions as features Classification of Social Media Posts, according to their Relevance Alexandre Pinto 46/60
  • 78. Feature Engineering • Number of used features: 201 Preprocessing methods: • Standardization / Normalization / Scaling Feature Selection/Reduction methods: • Information Gain/Gain Ratio • Chi-square (χ2) / Fisher score / Pearson Correlation • PCA (4 dimensions) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
  • 79. Feature Engineering • Number of used features: 201 Preprocessing methods: • Standardization / Normalization / Scaling Feature Selection/Reduction methods: • Information Gain/Gain Ratio • Chi-square (χ2) / Fisher score / Pearson Correlation • PCA (4 dimensions) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
  • 80. Feature Engineering • Number of used features: 201 Preprocessing methods: • Standardization / Normalization / Scaling Feature Selection/Reduction methods: • Information Gain/Gain Ratio • Chi-square (χ2) / Fisher score / Pearson Correlation • PCA (4 dimensions) Classification of Social Media Posts, according to their Relevance Alexandre Pinto 48/60
  • 81. Feature Engineering Results Classifier Support Vector Machine (SVM) Performance Metrics Accuracy Precision Recall F1 AP AUC Pipeline applied Standardization + full feature set 0.58 ± 0.12 0.65 ± 0.14 0.57 ± 0.21 0.58 ± 0.17 0.73 ± 0.10 0.58 ± 0.12 Normalization + full feature set 0.58 ± 0.09 0.66 ± 0.13 0.58 ± 0.16 0.60 ± 0.10 0.74 ± 0.07 0.59 ± 0.10 Scaling[0,1] + full feature set 0.59 ± 0.11 0.65 ± 0.15 0.61 ± 0.19 0.61 ± 0.14 0.74 ± 0.09 0.59 ± 0.11 Standardization + information gain 0.54 ± 0.09 0.62 ± 0.14 0.53 ± 0.37 0.47 ± 0.28 0.71 ± 0.10 0.54 ± 0.07 Standardization + gain ratio 0.54 ± 0.06 0.57 ± 0.06 0.66 ± 0.09 0.61 ± 0.05 0.71 ± 0.04 0.52 ± 0.06 Standardization + chi square (χ2 ) 0.56 ± 0.11 0.61 ± 0.20 0.58 ± 0.34 0.54 ± 0.24 0.71 ± 0.14 0.56 ± 0.11 Standardization + fisher score 0.54 ± 0.09 0.57 ± 0.09 0.54 ± 0.29 0.53 ± 0.18 0.68 ± 0.10 0.54 ± 0.08 Standardization + pearson 0.65 ± 0.16 0.66 ± 0.17 0.94 ± 0.11 0.76 ± 0.10 0.81 ± 0.08 0.61 ± 0.18 Standardization + pearson + pca4d 0.64 ± 0.16 0.65 ± 0.17 0.94 ± 0.11 0.75 ± 0.10 0.81 ± 0.09 0.61 ± 0.18 Standardization + gain ratio + pca4d 0.59 ± 0.05 0.58 ± 0.04 0.96 ± 0.10 0.72 ± 0.04 0.78 ± 0.03 0.54 ± 0.06 Table: Results of applying different Preprocessing and Feature Selection methods with a Support Vector Machine Classification of Social Media Posts, according to their Relevance Alexandre Pinto 49/60
  • 82. Feature Engineering Best ROC Curves Figure: ROC Curves of a kNN Classifier using Standardization and the Pearson Correlation Filter Classification of Social Media Posts, according to their Relevance Alexandre Pinto 50/60
  • 83. Feature Engineering Best PR Curves Figure: Precision-Recall Curves of a Naive Bayes Classifier using Standardization and the Pearson Correlation Filter Classification of Social Media Posts, according to their Relevance Alexandre Pinto 51/60
  • 85. Predicting Relevance through Journalistic Criteria Overview Figure: Prediction of Relevance using Journalistic Criteria Classification of Social Media Posts, according to their Relevance Alexandre Pinto 53/60
  • 86. Predicting Relevance through Journalistic Criteria Results Relevance based on Journalistic Criteria Performance Metrics Accuracy Precision Recall F1 AP AUC Intermediate Classifiers Minimum Distance Classifiers 0.62 ± 0.11 0.66 ± 0.17 0.89 ± 0.19 0.72 ± 0.08 0.80 ± 0.06 0.59 ± 0.13 K-Nearest Neighbors 0.54 ± 0.08 0.63 ± 0.14 0.57 ± 0.17 0.57 ± 0.08 0.72 ± 0.05 0.54 ± 0.09 Naive Bayes 0.56 ± 0.01 0.56 ± 0.01 0.97 ± 0.03 0.71 ± 0.01 0.77 ± 0.01 0.52 ± 0.01 Linear SVMs 0.54 ± 0.03 0.56 ± 0.02 0.89 ± 0.08 0.68 ± 0.03 0.75 ± 0.02 0.50 ± 0.04 Decision Trees 0.55 ± 0.05 0.57 ± 0.03 0.78 ± 0.11 0.65 ± 0.05 0.73 ± 0.03 0.52 ± 0.05 Random Forests 0.79 ± 0.07 0.80 ± 0.08 0.84 ± 0.07 0.82 ± 0.06 0.86 ± 0.04 0.78 ± 0.08 Table: Results on Predicting Relevance by an Ensemble of Journalistic Classifiers Classification of Social Media Posts, according to their Relevance Alexandre Pinto 54/60
  • 87. Predicting Relevance through Journalistic Criteria Best ROC Curves Figure: ROC Curves of a Journalistic Based kNN Classifier, using Random Forests for the intermediate classifiers Classification of Social Media Posts, according to their Relevance Alexandre Pinto 55/60
  • 88. Predicting Relevance through Journalistic Criteria Best PR Curves Figure: Precision-Recall Curves of a Journalistic Based kNN Classifier, using Random Forests for the intermediate classifiers Classification of Social Media Posts, according to their Relevance Alexandre Pinto 56/60
  • 90. Conclusion Final Remarks • Under the scope of the REMINDS project, a classifier was created using exclusively linguistic features. • Feature engineering leads to slightly better results, but the baseline experiments are still competitive • Future integration with other filters which consider other features. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
  • 91. Conclusion Final Remarks • Under the scope of the REMINDS project, a classifier was created using exclusively linguistic features. • Feature engineering leads to slightly better results, but the baseline experiments are still competitive • Future integration with other filters which consider other features. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
  • 92. Conclusion Final Remarks • Under the scope of the REMINDS project, a classifier was created using exclusively linguistic features. • Feature engineering leads to slightly better results, but the baseline experiments are still competitive • Future integration with other filters which consider other features. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 58/60
  • 93. Conclusion Final Remarks • Best approach uses an ensemble of classifiers targeted to each one of the journalistic criteria with linguistic features extracted from text, achieving a F1 score of 0.82 and an AUC of 0.78. • Results are in line with state of the art results that follow similar approaches but for classifying documents according to other criteria. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 59/60
  • 94. Conclusion Final Remarks • Best approach uses an ensemble of classifiers targeted to each one of the journalistic criteria with linguistic features extracted from text, achieving a F1 score of 0.82 and an AUC of 0.78. • Results are in line with state of the art results that follow similar approaches but for classifying documents according to other criteria. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 59/60
  • 95. Classification of Social Media Posts according to their Relevance Author: Alexandre Pinto Advisors: Prof. Dr. Hugo Gon¸calo Oliveira Prof. Dr. Ana Oliveira Alves Faculty of Sciences and Technology Department of Informatics Engineering University of Coimbra September 9, 2016
  • 96. References A. Pinto, H. Gon¸calo Oliveira, and A. Oliveira Alves. Comparing the Performance of Different NLP Toolkits in Formal and Social Media Text. In Marjan Mernik, Jos´e Paulo Leal, and Hugo Gon¸calo Oliveira, editors, 5th Symposium on Languages, Applications and Technologies (SLATE’16), volume 51 of OpenAccess Series in Informatics (OASIcs), pages 1–16, Dagstuhl, Germany, 2016. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ISBN 978-3-95977-006-4. doi: http://dx.doi.org/10.4230/OASIcs.SLATE.2016.3. URL http://drops.dagstuhl.de/opus/volltexte/2016/6008. Classification of Social Media Posts, according to their Relevance Alexandre Pinto 1/1