SlideShare a Scribd company logo
1 of 25
Download to read offline
P h D S t u d e n t
P a t r i c i a S á n c h e z -Ho l g a d o
D i r e c t o r
C a r l o s A r c i l a - C a l d e r ó n
Context
Twitter as a tool for scientific communication in Spain
Relevant network: user volume, free generation of content and its information in real time.
Advantage: immediacy | Disadvantage: saturation
• It has enormous potential and begins to be protagonist, but at the same time requires
efficient use.
• Twitter is the most used network by science journalists.
• Science communicators increasingly use digital technology and social networks.
• The first data on a scientific or technical scoop are already made public on Twitter.
• The opinion shown on Twitter has a direct link with national and international scientific news.
RQ1 - Can we analyze a part of the public data available in the social
network Twitter to know attitudes, opinions and sentiments towards
the communication topics of science that are shared?
Objectives
Main Objective:
Develop and evaluate a classifier for the analysis of sentiment of messages on scientific topics,
in Spanish and in real time, on the social network Twitter using machine learning techniques.
Secondary Objectives:
1. Creation of a specific corpus of texts classified by positive or negative sentiment.
2. Development of a prototype for the analysis of sentiment of scientific messages on Twitter
in real time.
3. Test the prototype.
Expected Results
Corpus of texts of scientific topics in Spanish,
labeled with positive or negative sentiment.
Prototype "OPSCIENCE" Spanish version
Methodology
Machine Learning
• Selection
• Preprocessing
• Transformation
• Modeling
• Interpretation
• Evaluation
Data Mining
Patterns in large
volumes of data set.
• Supervised:
establishes a
correspondence
between the desired
inputs and outputs of
the system.
Machine
Learning
It uses algorithms
and statistics to
understand, learn
and reproduce
human language.
• Probabilistic
models based on
data
Natural Language
Processing NLP
Computational study
of sentiments
expressed through
texts.
• Polarity: positive
or negative
Sentiment
Analysis
The goal of supervised machine learning is
create a function
that is able to
predict
what the value of an input element would be
after being trained with the sentiment classifier.
OPScience classificator
It allows to analyze locally the tone of scientific tweets in real time:
- Using free available resources such as Python (version 2.7) and the Application Program
Interface (API) of Twitter (REST and STREAMING).
- Based on the NLTK and Sci-Learn libraries for Python.
- Train a supervised machine learning model with 6 classification algorithms (Original Naive
Bayes Original, Naive Bayes for multimodal models, Naive Bayes for multivariate Bernoulli
models, Logistic Regression, Linear Support Vector Classification and Linear classifiers with
stochastic gradient descent -SGD- training).
Development of the project
STEP 1:
Creation of a corpus of scientific texts in Spanish
which will serve to train an automatic learning model.
STEP 2:
Supervised machine learning model
trained with 6 classification algorithms
STEP 3:
Real-time classifier test
Connecting to the Twitter streaming API
STEP 1. Creation of a corpus of scientific texts in Spanish
1.1 Acquisition of the Data
• Downloading data from Twitter
• Creating an app
• Data obtained
• Script for data download
Characteristics of the total dataset
Language Spanish
tweets downloaded in streaming 171.459
tweets downloaded in Rest 37.292
Total of downloaded tweets 208.751
STEP 1. Creation of a corpus of scientific texts in Spanish
1.2 Preprocessing of the data:
• Store the tweets in csv text.
• UTF / ANSI formats
• Spanish Language
• Texts in lowercase
• Retweets
• Suppression of possible
duplicates with R
• Tokenization
• Other preprocessing
• Manual classification of the
sentiment of the text
STEP 1. Creation of a corpus of scientific texts in Spanish
Corpus of texts:
10,000 elements
• 5,000 messages labeled as positive
• 5,000 messages labeled as negative
STEP 2. Supervised machine learning model
Learning: The classifier will be trained with the corpus of positive and negative scientific
tweets in Spanish: Training 70% - Test validation 30%
6 Algorithms used:
– Original Naive Bayes,
– Naive Bayes for multimodal models,
– Naive Bayes for multivariate Bernoulli models,
– Logistic Regression,
– Linear Support Vector Classification (SVC) and
– Linear classifiers with stochastic gradient descent -SGD- training.
Combination of classification algorithms: voting by feature intervals.
A voting system is created where each algorithm has one vote and the classification that
has the most votes is the one chosen.
STEP 3. Real-time classifier test
Validation of the Model
• Using these predictive models, the classifier will allow to connect to the streaming of
Twitter data in real time (using the API streaming available) and
• filter tweets by keywords or hashtag, written in Spanish about science to predict
the sentiment of each tweet generated
• and automatically visualize with the Matplot library those with high confidence
intervals (> 0.80).
Results
Classifier Results
Accuracy = correct predictions / total predictions
Average of this type of models 70%
Example: TASS project is around 72% (Cumbreras et al., 2016).
Algoritmo Accuracy %
Original Naive Bayes Algo 72.64
MNB_classifier 72.24
BernoulliNB_classifier 72.80
LogisticRegression_classifier 71.88
LinearSVC_classifier 70.45
SGDClassifier 71.15
Combination of classifiers
voted_classifier: Accuracy 72.31 %
Confussion Matrix
Predicción
Pos Neg
Real Pos TP FN
Neg TF TN
Predicción
Pos Neg
Real Pos <1158> 342
Neg 465 <1047>
Conclusions
Conclusions
• Microblogging and Twitter as a communication tool of Science.
• Preparation of a specific corpus of scientific texts in Spanish
• Training of a model: used algorithms and parameters.
• Evaluation of obtaining results. Accuracy 72%
• Test in real time.
Future lines of research
• This study can support the strategies of scientific communication.
• Test and study of individual results of the classification algorithms.
• Enlargement of the corpus and labeling with more classes: positive,
negative and neutral to include the informative messages.
• Measurement of the models at the end of each preprocessing phase, in
order to assess their relative importance.
• Real-time, large-scale studies with distributed computing.
Future lines of research
Continue
RQ1 - Can we analyze a part of the public data available in the social network Twitter to
know attitudes, opinions, sentiments towards the communication topics of science that
are shared?
with
 and move towards the prediction of future trends in science topics?.
Pa t r i c i a S á n c h e z - H o l ga d o
C a r l o s A r c i l a - C a l d e ró n

More Related Content

Similar to Towards the study of sentiment in the public opinion of science in Spanish

IQSS Presentation to Program in Health Policy
IQSS Presentation to Program in Health PolicyIQSS Presentation to Program in Health Policy
IQSS Presentation to Program in Health Policy
alexstorer
 
Invulformulier vakinformatie
Invulformulier vakinformatieInvulformulier vakinformatie
Invulformulier vakinformatie
butest
 

Similar to Towards the study of sentiment in the public opinion of science in Spanish (20)

IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
 
1. Intro DS.pptx
1. Intro DS.pptx1. Intro DS.pptx
1. Intro DS.pptx
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0
 
Loupe model - Use Cases and Requirements
Loupe model - Use Cases and Requirements Loupe model - Use Cases and Requirements
Loupe model - Use Cases and Requirements
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Chapter-1 - Notes.pptx
Chapter-1 - Notes.pptxChapter-1 - Notes.pptx
Chapter-1 - Notes.pptx
 
Political prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningPolitical prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learning
 
wendi_ppt
wendi_pptwendi_ppt
wendi_ppt
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Ibm piquant summary
Ibm piquant summaryIbm piquant summary
Ibm piquant summary
 
Political Prediction Analysis using text mining and deep learning.pptx
Political Prediction Analysis using text mining and deep learning.pptxPolitical Prediction Analysis using text mining and deep learning.pptx
Political Prediction Analysis using text mining and deep learning.pptx
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
IQSS Presentation to Program in Health Policy
IQSS Presentation to Program in Health PolicyIQSS Presentation to Program in Health Policy
IQSS Presentation to Program in Health Policy
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Real-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMReal-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTM
 
Invulformulier vakinformatie
Invulformulier vakinformatieInvulformulier vakinformatie
Invulformulier vakinformatie
 

More from Technological Ecosystems for Enhancing Multiculturality

More from Technological Ecosystems for Enhancing Multiculturality (20)

A Preliminary Study of Proof of Concept Practices and their connection with I...
A Preliminary Study of Proof of Concept Practices and their connection with I...A Preliminary Study of Proof of Concept Practices and their connection with I...
A Preliminary Study of Proof of Concept Practices and their connection with I...
 
Social networks as a promotional space for Spanish radio content. The case st...
Social networks as a promotional space for Spanish radio content. The case st...Social networks as a promotional space for Spanish radio content. The case st...
Social networks as a promotional space for Spanish radio content. The case st...
 
A Three-Step Data-Mining Analysis of Top-Ranked Higher Education Institutions...
A Three-Step Data-Mining Analysis of Top-Ranked Higher Education Institutions...A Three-Step Data-Mining Analysis of Top-Ranked Higher Education Institutions...
A Three-Step Data-Mining Analysis of Top-Ranked Higher Education Institutions...
 
Specifics of multimedia texts in the context of social networks media aesthetics
Specifics of multimedia texts in the context of social networks media aestheticsSpecifics of multimedia texts in the context of social networks media aesthetics
Specifics of multimedia texts in the context of social networks media aesthetics
 
Combined Effects of Similarity and Imagined Contact on First-Person Testimoni...
Combined Effects of Similarity and Imagined Contact on First-Person Testimoni...Combined Effects of Similarity and Imagined Contact on First-Person Testimoni...
Combined Effects of Similarity and Imagined Contact on First-Person Testimoni...
 
Direct online political communication effects on civil participation in spain...
Direct online political communication effects on civil participation in spain...Direct online political communication effects on civil participation in spain...
Direct online political communication effects on civil participation in spain...
 
University Media in Ecuador: Types, Functions and Self-determination
University Media in Ecuador: Types, Functions and Self-determinationUniversity Media in Ecuador: Types, Functions and Self-determination
University Media in Ecuador: Types, Functions and Self-determination
 
Like it or die: using social networks to improve collaborative learning in hi...
Like it or die: using social networks to improve collaborative learning in hi...Like it or die: using social networks to improve collaborative learning in hi...
Like it or die: using social networks to improve collaborative learning in hi...
 
Framing theory in studies of environmental information in press
Framing theory in studies of environmental information in pressFraming theory in studies of environmental information in press
Framing theory in studies of environmental information in press
 
Domain engineering for generating dashboards to analyze employment and employ...
Domain engineering for generating dashboards to analyze employment and employ...Domain engineering for generating dashboards to analyze employment and employ...
Domain engineering for generating dashboards to analyze employment and employ...
 
Mapping the systematic literature studies about software ecosystems
Mapping the systematic literature studies about software ecosystemsMapping the systematic literature studies about software ecosystems
Mapping the systematic literature studies about software ecosystems
 
Tag-Based Browsing of Digital Collections with Inverted Indexes and Browsing ...
Tag-Based Browsing of Digital Collections with Inverted Indexes and Browsing ...Tag-Based Browsing of Digital Collections with Inverted Indexes and Browsing ...
Tag-Based Browsing of Digital Collections with Inverted Indexes and Browsing ...
 
A Multivocal Literature Review on the use of DevOps for e-learning systems
A Multivocal Literature Review on the use of DevOps for e-learning systemsA Multivocal Literature Review on the use of DevOps for e-learning systems
A Multivocal Literature Review on the use of DevOps for e-learning systems
 
Document Annotation Tools: Annotation Classification Mechanisms
Document Annotation Tools: Annotation Classification MechanismsDocument Annotation Tools: Annotation Classification Mechanisms
Document Annotation Tools: Annotation Classification Mechanisms
 
Toward supporting decision-making under uncertainty in digital humanities wit...
Toward supporting decision-making under uncertainty in digital humanities wit...Toward supporting decision-making under uncertainty in digital humanities wit...
Toward supporting decision-making under uncertainty in digital humanities wit...
 
Managing Uncertainty in the Humanities: Digital and Analogue Approaches
Managing Uncertainty in the Humanities: Digital and Analogue ApproachesManaging Uncertainty in the Humanities: Digital and Analogue Approaches
Managing Uncertainty in the Humanities: Digital and Analogue Approaches
 
Representing Imprecise and Uncertain Knowledge in Digital Humanities: A Theor...
Representing Imprecise and Uncertain Knowledge in Digital Humanities: A Theor...Representing Imprecise and Uncertain Knowledge in Digital Humanities: A Theor...
Representing Imprecise and Uncertain Knowledge in Digital Humanities: A Theor...
 
Dotmocracy and Planning Poker for Uncertainty Management in Collaborative Res...
Dotmocracy and Planning Poker for Uncertainty Management in Collaborative Res...Dotmocracy and Planning Poker for Uncertainty Management in Collaborative Res...
Dotmocracy and Planning Poker for Uncertainty Management in Collaborative Res...
 
Applying Commercial Computer Vision Tools to Cope with Uncertainties in a Cit...
Applying Commercial Computer Vision Tools to Cope with Uncertainties in a Cit...Applying Commercial Computer Vision Tools to Cope with Uncertainties in a Cit...
Applying Commercial Computer Vision Tools to Cope with Uncertainties in a Cit...
 
Appliying topic modeling techniques to degraded texts. Spanish historical pre...
Appliying topic modeling techniques to degraded texts. Spanish historical pre...Appliying topic modeling techniques to degraded texts. Spanish historical pre...
Appliying topic modeling techniques to degraded texts. Spanish historical pre...
 

Recently uploaded

Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tube
SaadHumayun7
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
heathfieldcps1
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
Avinash Rai
 

Recently uploaded (20)

NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tube
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptx
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
 
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdfPost Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
 
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdf
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdfPost Exam Fun(da) Intra UEM General Quiz - Finals.pdf
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdf
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
Advances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdfAdvances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdf
 

Towards the study of sentiment in the public opinion of science in Spanish

  • 1. P h D S t u d e n t P a t r i c i a S á n c h e z -Ho l g a d o D i r e c t o r C a r l o s A r c i l a - C a l d e r ó n
  • 3. Twitter as a tool for scientific communication in Spain Relevant network: user volume, free generation of content and its information in real time. Advantage: immediacy | Disadvantage: saturation • It has enormous potential and begins to be protagonist, but at the same time requires efficient use. • Twitter is the most used network by science journalists. • Science communicators increasingly use digital technology and social networks. • The first data on a scientific or technical scoop are already made public on Twitter. • The opinion shown on Twitter has a direct link with national and international scientific news.
  • 4. RQ1 - Can we analyze a part of the public data available in the social network Twitter to know attitudes, opinions and sentiments towards the communication topics of science that are shared?
  • 5. Objectives Main Objective: Develop and evaluate a classifier for the analysis of sentiment of messages on scientific topics, in Spanish and in real time, on the social network Twitter using machine learning techniques. Secondary Objectives: 1. Creation of a specific corpus of texts classified by positive or negative sentiment. 2. Development of a prototype for the analysis of sentiment of scientific messages on Twitter in real time. 3. Test the prototype.
  • 6. Expected Results Corpus of texts of scientific topics in Spanish, labeled with positive or negative sentiment. Prototype "OPSCIENCE" Spanish version
  • 8. Machine Learning • Selection • Preprocessing • Transformation • Modeling • Interpretation • Evaluation Data Mining Patterns in large volumes of data set. • Supervised: establishes a correspondence between the desired inputs and outputs of the system. Machine Learning It uses algorithms and statistics to understand, learn and reproduce human language. • Probabilistic models based on data Natural Language Processing NLP Computational study of sentiments expressed through texts. • Polarity: positive or negative Sentiment Analysis
  • 9. The goal of supervised machine learning is create a function that is able to predict what the value of an input element would be after being trained with the sentiment classifier.
  • 10. OPScience classificator It allows to analyze locally the tone of scientific tweets in real time: - Using free available resources such as Python (version 2.7) and the Application Program Interface (API) of Twitter (REST and STREAMING). - Based on the NLTK and Sci-Learn libraries for Python. - Train a supervised machine learning model with 6 classification algorithms (Original Naive Bayes Original, Naive Bayes for multimodal models, Naive Bayes for multivariate Bernoulli models, Logistic Regression, Linear Support Vector Classification and Linear classifiers with stochastic gradient descent -SGD- training).
  • 12. STEP 1: Creation of a corpus of scientific texts in Spanish which will serve to train an automatic learning model. STEP 2: Supervised machine learning model trained with 6 classification algorithms STEP 3: Real-time classifier test Connecting to the Twitter streaming API
  • 13. STEP 1. Creation of a corpus of scientific texts in Spanish 1.1 Acquisition of the Data • Downloading data from Twitter • Creating an app • Data obtained • Script for data download Characteristics of the total dataset Language Spanish tweets downloaded in streaming 171.459 tweets downloaded in Rest 37.292 Total of downloaded tweets 208.751
  • 14. STEP 1. Creation of a corpus of scientific texts in Spanish 1.2 Preprocessing of the data: • Store the tweets in csv text. • UTF / ANSI formats • Spanish Language • Texts in lowercase • Retweets • Suppression of possible duplicates with R • Tokenization • Other preprocessing • Manual classification of the sentiment of the text
  • 15. STEP 1. Creation of a corpus of scientific texts in Spanish Corpus of texts: 10,000 elements • 5,000 messages labeled as positive • 5,000 messages labeled as negative
  • 16. STEP 2. Supervised machine learning model Learning: The classifier will be trained with the corpus of positive and negative scientific tweets in Spanish: Training 70% - Test validation 30% 6 Algorithms used: – Original Naive Bayes, – Naive Bayes for multimodal models, – Naive Bayes for multivariate Bernoulli models, – Logistic Regression, – Linear Support Vector Classification (SVC) and – Linear classifiers with stochastic gradient descent -SGD- training. Combination of classification algorithms: voting by feature intervals. A voting system is created where each algorithm has one vote and the classification that has the most votes is the one chosen.
  • 17. STEP 3. Real-time classifier test Validation of the Model • Using these predictive models, the classifier will allow to connect to the streaming of Twitter data in real time (using the API streaming available) and • filter tweets by keywords or hashtag, written in Spanish about science to predict the sentiment of each tweet generated • and automatically visualize with the Matplot library those with high confidence intervals (> 0.80).
  • 19. Classifier Results Accuracy = correct predictions / total predictions Average of this type of models 70% Example: TASS project is around 72% (Cumbreras et al., 2016). Algoritmo Accuracy % Original Naive Bayes Algo 72.64 MNB_classifier 72.24 BernoulliNB_classifier 72.80 LogisticRegression_classifier 71.88 LinearSVC_classifier 70.45 SGDClassifier 71.15
  • 20. Combination of classifiers voted_classifier: Accuracy 72.31 % Confussion Matrix Predicción Pos Neg Real Pos TP FN Neg TF TN Predicción Pos Neg Real Pos <1158> 342 Neg 465 <1047>
  • 22. Conclusions • Microblogging and Twitter as a communication tool of Science. • Preparation of a specific corpus of scientific texts in Spanish • Training of a model: used algorithms and parameters. • Evaluation of obtaining results. Accuracy 72% • Test in real time.
  • 23. Future lines of research • This study can support the strategies of scientific communication. • Test and study of individual results of the classification algorithms. • Enlargement of the corpus and labeling with more classes: positive, negative and neutral to include the informative messages. • Measurement of the models at the end of each preprocessing phase, in order to assess their relative importance. • Real-time, large-scale studies with distributed computing.
  • 24. Future lines of research Continue RQ1 - Can we analyze a part of the public data available in the social network Twitter to know attitudes, opinions, sentiments towards the communication topics of science that are shared? with  and move towards the prediction of future trends in science topics?.
  • 25. Pa t r i c i a S á n c h e z - H o l ga d o C a r l o s A r c i l a - C a l d e ró n