DETECTING AND VERIFYING ONLINE DISINFORMATION:
HOW NLP AND DATA ANALYSIS CAN HELP
Carolina Scarton
c.scarton@sheffield.ac.uk
carolscarton
2º Workshop em Data Science
online workshop, 08/02/2021
WHY WEVERIFY?
2
Credit: Zlatina Marinova (Ontotext)
WHY WEVERIFY?
3
Credit: Zlatina Marinova (Ontotext)
WEVERIFY ARCHITECTURE
https://weverify.eu/
4
TRULY MEDIA: COLLABORATIVE CROSS-MODAL
VERIFICATION WORKBENCH
5
Credit: Zlatina Marinova (Ontotext)
TRULY MEDIA: COLLABORATIVE CROSS-MODAL
VERIFICATION WORKBENCH
6
Credit: Zlatina Marinova (Ontotext)
TRULY MEDIA: COLLABORATIVE CROSS-MODAL
VERIFICATION WORKBENCH
7
Credit: Zlatina Marinova (Ontotext)
VERIFICATION BROWSER PLUGIN
Available for Chrome, Firefox and Opera
8
VERIFICATION BROWSER PLUGIN
USERS :
AND MANY MORE ...
9
Credit: Zlatina Marinova (Ontotext)
VISUAL EXPLORATION OF DISINFORMATION NETWORKS
10
Credit: Zlatina Marinova (Ontotext)
VISUAL EXPLORATION OF DISINFORMATION NETWORKS
11
Credit: Zlatina Marinova (Ontotext)
BLOCKCHAIN DATABASE OF KNOWN FAKES
12
Credit: Denis Teyssou (AFP)
BLOCKCHAIN DATABASE OF KNOWN FAKES
13
BLOCKCHAIN DATABASE OF KNOWN FAKES
14
BLOCKCHAIN DATABASE OF KNOWN FAKES
15
BLOCKCHAIN DATABASE OF KNOWN FAKES
16
BLOCKCHAIN DATABASE OF KNOWN FAKES
17
BLOCKCHAIN DATABASE OF KNOWN FAKES
18
VERACITY AND STANCE ANALYSIS OF ONLINE CONVERSATIONS
https://tweetveracity.gate.ac.uk/
19
VERACITY AND STANCE ANALYSIS OF ONLINE CONVERSATIONS
20
VERACITY AND STANCE ANALYSIS OF ONLINE CONVERSATIONS
21
Year 1: Feature-based approach
➢ Textual + twitter metadata
information
➢ Traditional ML models
➢ Extended in WeVerify
macro-F1: 0.486
VERACITY AND STANCE ANALYSIS OF ONLINE CONVERSATIONS
22
Year 1: Feature-based approach
➢ Textual + twitter metadata
information
➢ Traditional ML models
➢ Extended in WeVerify
macro-F1: 0.486
Year 2: Feature-based approach
➢ Textual + twitter metadata
information
➢ Traditional ML models
➢ Imbalanced data treatment
➢ New approach in WeVerify
macro-F1: 0.484
VERACITY AND STANCE ANALYSIS OF ONLINE CONVERSATIONS
23
Year 1: Feature-based approach
➢ Textual + twitter metadata
information
➢ Traditional ML models
➢ Extended in WeVerify
macro-F1: 0.486
Year 2: Feature-based approach
➢ Textual + twitter metadata
information
➢ Traditional ML models
➢ Imbalanced data treatment
➢ New approach in WeVerify
macro-F1: 0.484
Year 2: BERT-based approach
➢ Deep-learning approach →
text only
➢ Imbalanced data treatment
➢ Platform independent
➢ New approach in WeVerify
macro-F1: 0.513
IMAGE/MEME OCR
24
➢ The advert makes a highly misleading claim,
but this contained only in the image
○ it’s six in the next parliament, and none is a
completely new hospital
○ https://fullfact.org/election-2019/ads/
➢ There were between 800k-900k impressions of
adverts that used this image
https://www.facebook.com/ads/library/?active_status=all&ad_type=political_
and_issue_ads&country=GB&view_all_page_id=8807334278
Credit: Mark Greenwood (Sheffield)
COVID-19 CLASSIFIER
➢ Classification of misinformation into 10 different categories (Brennen et al., 2020):
• Public authority
• Community spread and impact
• Medical advice, self-treatments and virus effects
• Prominent actors
• Conspiracy theory
• Virus transmission
• Virus origins and properties
• Public preparedness
• Vaccines, medical treatment, and tests
• Cannot determine
Scott Brennen, Felix Simon, Philip Howard, and Rasmus Kleis Nielsen. 2020. Types, sources, and claims of covid-19 misinformation. Technical report, Reuters Institute
25
COVID-19 CLASSIFIER
➢ IFCN dataset:
• Claim → rephrased by a fact-checker
• Explanation → why a claim is false
• Source link → page of the debunk
• Date → of publication in the IFCN Poynter website
• Type of media
• Original website
https://www.poynter.org/ifcn-covid-19-misinformation/
26
COVID-19 CLASSIFIER
➢ Data annotation:
• 27 volunteers (English claims only)
• Assign most relevant class + a confidence score
• After removing low quality annotations → 1,293 debunks annotated (Cohen's k= 0.70)
27
Credit: Xingyi Song (Sheffield)
COVID-19 CLASSIFIER
➢ Large-scale analysis:
• Start with Conspiracy theories
• Then Community spread and Virus origin in mid February
• Public authority action dominate in March
• Community Spread and Prominent actors are top topics in the later period
28
Credit: Xingyi Song (Sheffield)
COVID-19 CLASSIFIER
➢ Large-scale analysis:
• 50% text
• ~ 50% of Virus origin and Public preparedness → videos
29
Credit: Xingyi Song (Sheffield)
COVID-19 CLASSIFIER
➢ Large-scale analysis:
• Most spread in social media → Public Authority action and Community spread
• Instagram and Tiktok → Virus origin
• News, Youtube, Blog and TV → Conspiracy theory
• Messages apps → General medical advise
30
Credit: Xingyi Song (Sheffield)
COVID-19 CLASSIFIER
➢ Large-scale analysis:
• Community spread → highest in misleading claims
• ~ 50% of No evidence → General advise
31
Credit: Xingyi Song (Sheffield)
ONGOING AND FUTURE WORK
32
COVID-19 SOCIAL MEDIA ANALYSIS
➢ Matching debunks (IFCN dataset) with tweets → finding misinformation
"[The coronavirus is] ‘new’ yet it was lab-created and patented in 2015"
https://www.poynter.org/?ifcn_misinformation=the-coronavirus-is-new-yet-it-was-lab-created-and-patented-in-2015
33
COVID-19 SOCIAL MEDIA ANALYSIS
➢ Matching debunks (IFCN dataset) with tweets → finding debunks
"An alkaline diet could prevent COVID-19"
https://www.poynter.org/?ifcn_misinformation=alkaline-diet-could-prevent-covid-19
34
DIGITAL COMPANION ASSISTANT
35
MULTILINGUAL MISINFORMATION ANALYSIS
36
MULTILINGUAL MISINFORMATION ANALYSIS
37
A diet rich in alkaline foods can eliminate the coronavirus
2020/03/01: Spain https://maldita.es/malditaciencia/2020/03/01/cuerpo-alcalino-no-previene-coronavirus-tec-monterrey/ (WhatsApp)
2020/04/02: Indonesia https://cekfakta.tempo.co/fakta/715/fakta-atau-hoaks-benarkah-virus-corona-bisa-dibunuh-dengan-konsumsi-makanan-alkali
(WhatsApp)
2020/04/02: US
https://leadstories.com/hoax-alert/2020/04/Fact-Check-Alkaline-Diet-Does-NOT-Prevent-You-From-Getting-Coronavirus.html(Facebook)
2020/04/06: Venezuela https://efectococuyo.com/cocuyo-chequea/la-dieta-alcalina-ayuda-a-prevenir-la-covid-19/ (WhatsApp)
2020/06/16: Spain https://www.efe.com/efe/espana/efeverifica/consumir-alimentos-alcalinos-no-protege-del-coronavirus/50001435-4272618
(Facebook and Twitter)
2020/09/12: Turkey https://teyit.org/yeni-koronavirusun-alkali-beslenerek-yok-edilebilecegi-iddiasi (Facebook)
2020/12/09: Brazil https://www.aosfatos.org/noticias/dieta-rica-em-alimentos-alcalinos-nao-e-capaz-de-eliminar-o-coronavirus/ (Facebook)
GATECLOUD WEVERIFY SERVICES
38
THANK YOU FOR YOUR ATTENTION!
www.weverify.eu
@WeVerify

2nd workshop em data science 08 02 2021

  • 1.
    DETECTING AND VERIFYINGONLINE DISINFORMATION: HOW NLP AND DATA ANALYSIS CAN HELP Carolina Scarton c.scarton@sheffield.ac.uk carolscarton 2º Workshop em Data Science online workshop, 08/02/2021
  • 2.
  • 3.
  • 4.
  • 5.
    TRULY MEDIA: COLLABORATIVECROSS-MODAL VERIFICATION WORKBENCH 5 Credit: Zlatina Marinova (Ontotext)
  • 6.
    TRULY MEDIA: COLLABORATIVECROSS-MODAL VERIFICATION WORKBENCH 6 Credit: Zlatina Marinova (Ontotext)
  • 7.
    TRULY MEDIA: COLLABORATIVECROSS-MODAL VERIFICATION WORKBENCH 7 Credit: Zlatina Marinova (Ontotext)
  • 8.
    VERIFICATION BROWSER PLUGIN Availablefor Chrome, Firefox and Opera 8
  • 9.
    VERIFICATION BROWSER PLUGIN USERS: AND MANY MORE ... 9 Credit: Zlatina Marinova (Ontotext)
  • 10.
    VISUAL EXPLORATION OFDISINFORMATION NETWORKS 10 Credit: Zlatina Marinova (Ontotext)
  • 11.
    VISUAL EXPLORATION OFDISINFORMATION NETWORKS 11 Credit: Zlatina Marinova (Ontotext)
  • 12.
    BLOCKCHAIN DATABASE OFKNOWN FAKES 12 Credit: Denis Teyssou (AFP)
  • 13.
    BLOCKCHAIN DATABASE OFKNOWN FAKES 13
  • 14.
    BLOCKCHAIN DATABASE OFKNOWN FAKES 14
  • 15.
    BLOCKCHAIN DATABASE OFKNOWN FAKES 15
  • 16.
    BLOCKCHAIN DATABASE OFKNOWN FAKES 16
  • 17.
    BLOCKCHAIN DATABASE OFKNOWN FAKES 17
  • 18.
    BLOCKCHAIN DATABASE OFKNOWN FAKES 18
  • 19.
    VERACITY AND STANCEANALYSIS OF ONLINE CONVERSATIONS https://tweetveracity.gate.ac.uk/ 19
  • 20.
    VERACITY AND STANCEANALYSIS OF ONLINE CONVERSATIONS 20
  • 21.
    VERACITY AND STANCEANALYSIS OF ONLINE CONVERSATIONS 21 Year 1: Feature-based approach ➢ Textual + twitter metadata information ➢ Traditional ML models ➢ Extended in WeVerify macro-F1: 0.486
  • 22.
    VERACITY AND STANCEANALYSIS OF ONLINE CONVERSATIONS 22 Year 1: Feature-based approach ➢ Textual + twitter metadata information ➢ Traditional ML models ➢ Extended in WeVerify macro-F1: 0.486 Year 2: Feature-based approach ➢ Textual + twitter metadata information ➢ Traditional ML models ➢ Imbalanced data treatment ➢ New approach in WeVerify macro-F1: 0.484
  • 23.
    VERACITY AND STANCEANALYSIS OF ONLINE CONVERSATIONS 23 Year 1: Feature-based approach ➢ Textual + twitter metadata information ➢ Traditional ML models ➢ Extended in WeVerify macro-F1: 0.486 Year 2: Feature-based approach ➢ Textual + twitter metadata information ➢ Traditional ML models ➢ Imbalanced data treatment ➢ New approach in WeVerify macro-F1: 0.484 Year 2: BERT-based approach ➢ Deep-learning approach → text only ➢ Imbalanced data treatment ➢ Platform independent ➢ New approach in WeVerify macro-F1: 0.513
  • 24.
    IMAGE/MEME OCR 24 ➢ Theadvert makes a highly misleading claim, but this contained only in the image ○ it’s six in the next parliament, and none is a completely new hospital ○ https://fullfact.org/election-2019/ads/ ➢ There were between 800k-900k impressions of adverts that used this image https://www.facebook.com/ads/library/?active_status=all&ad_type=political_ and_issue_ads&country=GB&view_all_page_id=8807334278 Credit: Mark Greenwood (Sheffield)
  • 25.
    COVID-19 CLASSIFIER ➢ Classificationof misinformation into 10 different categories (Brennen et al., 2020): • Public authority • Community spread and impact • Medical advice, self-treatments and virus effects • Prominent actors • Conspiracy theory • Virus transmission • Virus origins and properties • Public preparedness • Vaccines, medical treatment, and tests • Cannot determine Scott Brennen, Felix Simon, Philip Howard, and Rasmus Kleis Nielsen. 2020. Types, sources, and claims of covid-19 misinformation. Technical report, Reuters Institute 25
  • 26.
    COVID-19 CLASSIFIER ➢ IFCNdataset: • Claim → rephrased by a fact-checker • Explanation → why a claim is false • Source link → page of the debunk • Date → of publication in the IFCN Poynter website • Type of media • Original website https://www.poynter.org/ifcn-covid-19-misinformation/ 26
  • 27.
    COVID-19 CLASSIFIER ➢ Dataannotation: • 27 volunteers (English claims only) • Assign most relevant class + a confidence score • After removing low quality annotations → 1,293 debunks annotated (Cohen's k= 0.70) 27 Credit: Xingyi Song (Sheffield)
  • 28.
    COVID-19 CLASSIFIER ➢ Large-scaleanalysis: • Start with Conspiracy theories • Then Community spread and Virus origin in mid February • Public authority action dominate in March • Community Spread and Prominent actors are top topics in the later period 28 Credit: Xingyi Song (Sheffield)
  • 29.
    COVID-19 CLASSIFIER ➢ Large-scaleanalysis: • 50% text • ~ 50% of Virus origin and Public preparedness → videos 29 Credit: Xingyi Song (Sheffield)
  • 30.
    COVID-19 CLASSIFIER ➢ Large-scaleanalysis: • Most spread in social media → Public Authority action and Community spread • Instagram and Tiktok → Virus origin • News, Youtube, Blog and TV → Conspiracy theory • Messages apps → General medical advise 30 Credit: Xingyi Song (Sheffield)
  • 31.
    COVID-19 CLASSIFIER ➢ Large-scaleanalysis: • Community spread → highest in misleading claims • ~ 50% of No evidence → General advise 31 Credit: Xingyi Song (Sheffield)
  • 32.
  • 33.
    COVID-19 SOCIAL MEDIAANALYSIS ➢ Matching debunks (IFCN dataset) with tweets → finding misinformation "[The coronavirus is] ‘new’ yet it was lab-created and patented in 2015" https://www.poynter.org/?ifcn_misinformation=the-coronavirus-is-new-yet-it-was-lab-created-and-patented-in-2015 33
  • 34.
    COVID-19 SOCIAL MEDIAANALYSIS ➢ Matching debunks (IFCN dataset) with tweets → finding debunks "An alkaline diet could prevent COVID-19" https://www.poynter.org/?ifcn_misinformation=alkaline-diet-could-prevent-covid-19 34
  • 35.
  • 36.
  • 37.
    MULTILINGUAL MISINFORMATION ANALYSIS 37 Adiet rich in alkaline foods can eliminate the coronavirus 2020/03/01: Spain https://maldita.es/malditaciencia/2020/03/01/cuerpo-alcalino-no-previene-coronavirus-tec-monterrey/ (WhatsApp) 2020/04/02: Indonesia https://cekfakta.tempo.co/fakta/715/fakta-atau-hoaks-benarkah-virus-corona-bisa-dibunuh-dengan-konsumsi-makanan-alkali (WhatsApp) 2020/04/02: US https://leadstories.com/hoax-alert/2020/04/Fact-Check-Alkaline-Diet-Does-NOT-Prevent-You-From-Getting-Coronavirus.html(Facebook) 2020/04/06: Venezuela https://efectococuyo.com/cocuyo-chequea/la-dieta-alcalina-ayuda-a-prevenir-la-covid-19/ (WhatsApp) 2020/06/16: Spain https://www.efe.com/efe/espana/efeverifica/consumir-alimentos-alcalinos-no-protege-del-coronavirus/50001435-4272618 (Facebook and Twitter) 2020/09/12: Turkey https://teyit.org/yeni-koronavirusun-alkali-beslenerek-yok-edilebilecegi-iddiasi (Facebook) 2020/12/09: Brazil https://www.aosfatos.org/noticias/dieta-rica-em-alimentos-alcalinos-nao-e-capaz-de-eliminar-o-coronavirus/ (Facebook)
  • 38.
  • 39.
    THANK YOU FORYOUR ATTENTION! www.weverify.eu @WeVerify