Beyond Fact Checking — Modelling Information Change in Scientific Communication
The document discusses modelling information change in scientific communication. It begins by noting how science is often communicated through journalists to the public, and how the message can change and become exaggerated or misleading along the way. It then discusses developing models to detect exaggeration by predicting the strength of causal claims, such as distinguishing between correlational and causal language. Pattern exploiting training is explored as a way to leverage large language models for this task in a semi-supervised manner. Finally, it proposes generally modelling information change by comparing original research to how it is communicated elsewhere, such as in news articles and tweets, using semantic matching techniques. Experiments are discussed on newly created datasets to benchmark performance of models on this task.
Beyond Fact Checking — Modelling Information Change in Scientific Communication
1.
Beyond Fact Checking—
Modelling Information Change in
Scientific Communication
Isabelle Augenstein*
AAAI
11 February 2023
*credit for some slides: Dustin Wright
Scientists Journalist
s
The Public
2.
How science iscommunicated matters
I can still do that
HIV Vaccine may raise risk
Never!
Scientists have found that
HIV vaccine has many side
effects!
Affects trust in science and
future actions
Kuru et al., (2019); Gustafson and Rice,
(2019); Fischhoff, (2012); Morton, (2010) https://www.nature.com/articles/450325a
The public relieson journalists to learn scientific findings
The public perception of science
is largely shaped by how
journalists present science
instead of science itself.
5.
… despite seeingsubstantial issues with how science is reported
The lack of domain-specific
scientific knowledge makes it
difficult to critically evaluate
science news coverage.
6.
Skewed reporting ofscience undermines trust in science
Hyped-up polarised news
articles (”caffeine causes
cancer” / ”coffee cures cancer”)
lead to uncertainty and erosion
of trust in scientists
Schoenfeld and Ioannidis: ”Is everything we eat associated with cancer?
A systematic cookbook review”, American Journal of Clinical Nutrition,
2013. https://pubmed.ncbi.nlm.nih.gov/23193004/
https://www.vox.com/science-and-health/2019/6/11/18652225/hype-
science-press-releases
7.
It’s easy forthe message to change
Fang et al. (2016)
#Magnesium
saves lives
Reuters (2016) Twitter
The study findings suggest that
increased consumption of
magnesium-rich foods may
have health benefits.
Increasing dietary magnesium intake
is associated with a reduced risk of
stroke, heart failure, diabetes, and
all-cause mortality.
8.
It’s easy forthe message to change
#Magnesium
saves lives
The study findings suggest that
increased consumption of
magnesium-rich foods may
have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure, diabetes,
and all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
9.
It’s easy forthe message to change
#Magnesium
saves lives
The study findings suggest
that increased consumption of
magnesium-rich foods may
have health benefits.
Increasing dietary magnesium intake
is associated with a reduced risk of
stroke, heart failure, diabetes, and
all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
10.
It’s easy forthe message to change
#Magnesium
saves lives
The study findings suggest that
increased consumption of
magnesium-rich foods may
have health benefits.
Increasing dietary magnesium intake
is associated with a reduced risk of
stroke, heart failure, diabetes, and
all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
11.
It’s easy forthe message to change
#Magnesium
saves lives
The study findings suggest
that increased consumption
of magnesium-rich foods
may have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
The message isn’t necessarily false, but it can be misleading and
inaccurate and lead to behavior change
12.
Modelling Information Change-- Automatic Fact Checking
Claim Check-
Worthiness Detection
Evidence Document
Retrieval and Ranking
Recognising Textual
Entailment
Veracity Prediction
“Magnesium saves lives”
not check-worthy
check-worthy
“Magnesium saves lives”
“Magnesium saves lives”,
“Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality”
positive
negative
neutral
true
false
not enough info
“Magnesium saves lives”
13.
Evidence Ranking forAutomatic Fact Checking
Evidence Document
Retrieval and Ranking
“Magnesium saves lives”
“Magnesium saves lives”,
“The study findings suggest that increasing dietary magnesium
intake is associated with a reduced risk of stroke, heart failure,
diabetes, and all-cause mortality”
● Notion of similarity matters
○ Strict textual similarity (most prior work)
○ Similarity of information content (proposed here)
● Domain differences increase task difficulty
○ Measure similarity between <claim, evidence> from <news, news> (most prior work)
○ Measure similarity between <claim, evidence> from <news, press release/twitter>
(proposed here)
14.
Overview of Today’sTalk
● Introduction
○ The Life Cycle of Science Communication
● Part 1: Exaggeration Detection
○ Measuring differences in stated causal relationships
○ Experiments with health science press releases
● Part 2: Modelling Information Change
○ Modelling information change in communicating scientific findings more broadly
○ Experiments with press releases and tweets in different scientific domains
● Outlook and Conclusion
○ Future research challenges
15.
Exaggeration Detection ofScience Press Releases
The study findings suggest
that increased consumption
of magnesium-rich foods
may have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality.
Fang et al. (2016) Reuters (2016)
Problem: the strength of the claim changing from a correlational statement
(“associated with”) to conditionally causal in the news (”suggest”, “may”)
16.
Exaggeration in ScienceJournalism
Sumner et al. 20141 and Bratton et al. 20192: InSciOut
Sumner, P., Vivian-Griffiths, S., Boivin, J., Williams, A., Venetis, C. A., Davies, A., ... & Chambers, C. D. (2014). The association between exaggeration in health related science
news and academic press releases: retrospective observational study. Bmj, 349.
Bratton, L., Adams, R. C., Challenger, A., Boivin, J., Bott, L., Chambers, C. D., & Sumner, P. (2019). The association between exaggeration in health-related science news and
academic press releases: a replication study. Wellcome open research, 4.
Objective: To identify the source (press releases or news) of distortions,
exaggerations, or changes to the main conclusions drawn from research that could
potentially influence a reader’s health related behaviour.
Conclusions:
• 33% of press releases contain exaggerations of conclusions of scientific papers
• Exaggeration in news is strongly associated with exaggeration in press releases
17.
Modelling Information Change– Causal Claim Strength Prediction
Label Type Language Cue
0 No Relation
1 Correlational
association, associated with, predictor, at
high risk of
2 Conditional causal
increase, decrease, lead to, effect on,
contribute to, result in (Cues indicating
doubt: may, might, appear to, probably)
3 Direct causal
increase, decrease, lead to, effective on,
contribute to, reduce, can
Li et al. ”An NLP Analysis of Exaggerated Claims in Science News.” In NLPmJ@EMNLP, 2017.
Yu et al. ”Measuring Correlation-to-Causation Exaggeration in Press Releases”. In Coling 2020.
18.
Our Work onExaggeration Detection in Science
Formalize the task of scientific exaggeration detection:
predicting when a press release exaggerates a scientific paper
Curate a dataset from expert annotations to benchmark performance
Input: primary finding of the paper as written in the abstract and the press release
Investigate and develop methods for automatic scientific exaggeration detection
Semi-supervised method based on Pattern Exploiting Training (PET)
Wright et al. ”Semi-Supervised Exaggeration Detection of Health Science Press Releases”. In EMNLP 2021.
https://aclanthology.org/2021.emnlp-main.845/
19.
Task Formulations
Label TypeLanguage Cue
0 No Relation
1 Correlational
association, associated with,
predictor, at high risk of
2 Conditional causal
increase, decrease, lead to, effect
on, contribute to, result in (Cues
indicating doubt: may, might,
appear to, probably)
3 Direct causal
increase, decrease, lead to,
effective on, contribute to,
reduce, can
Li et al. ”An NLP Analysis of Exaggerated Claims
in Science News.” In NLPmJ@EMNLP, 2017.
Exaggeration detection
• Entailment-like task
• Paired (press release, abstract) data
ℒ𝑇1 =
0 Downplays
1 Same
2 Exaggerates
Causal claim strength prediction
• Text classification task
• Unpaired press releases and abstracts
• Final prediction compares strength of
paired press release and abstract
ℒ𝑇2 =
0 No Relation
1 Correlational
2 Conditional Causal
3 Direct Causal
20.
Pattern Exploiting Training(Schick et al. 2020)
Eating chocolate
causes happiness
𝐶 0.01 0.21 0.15 𝟎. 𝟔𝟑
0 1 2 3
Traditional Classifier
Eating chocolate causes
happiness. The claim
strength is [MASK]
ℳ 0.01 0.21 0.15 𝟎. 𝟔𝟑
PET
Pattern: transform the input to a
cloze-style question
Verbalizer: predict tokens from
the language model which reflect
the data’s labels
Large pretrained
language model
𝑃0
𝑃1
𝑃2
ℳ0
ℳ1
ℳ2
𝑈
𝐶
𝐷 𝑈
Soft Labels
KL-Divergence Loss
(Unlabelled)
MT-PET for ExaggerationDetection
Name Pattern
𝑃𝑇1
0 Scientists claim s. Reporters claim t. The reporters claims are
[MASK]
𝑃𝑇2
0 [Scientists|Reporters] say [s |t ]. The claim strength is [MASK]
𝑃𝑇1
1 Academic literature claims s. Popular media claims t. The media
claims are [MASK]
𝑃𝑇2
1 [Academic literature|Popular media] says [s |t ]. The claim
strength is [MASK]
Our tasks are T1 (exaggeration prediction) and T2 (claim strength prediction)
We develop patterns by hand and verbalizers semi-automatically using PETAL (Schick et al. 2020)
s and t are the claim text in the abstract and press release, respectively
T1 (Exaggeration Detection)with MT-PET
28.06
33.1
29.05
41.9
39.87 39.12
47.8 47.99 47.35
25
30
35
40
45
50
P R F1
Supervised PET MT-PET
Substantial improvements when using PET (10 points)
Further improvements with MT-PET (8 points)
Demonstrates transfer of knowledge from claim strength prediction to exaggeration prediction
25.
Learning Dynamics forT2 (Claim Strength Prediction)
MT-PET with 200 samples
approaches performance of vanilla
PET with 500 samples
MT-PET with 200 samples
approaches performance on
supervised learning with 4,500
samples
PET always outperforms supervised
learning
26.
Overview of Today’sTalk
● Introduction
○ The Life Cycle of Science Communication
● Part 1: Exaggeration Detection
○ Measuring differences in stated causal relationships
○ Experiments with health science press releases
● Part 2: Modelling Information Change
○ Modelling information change in communicating scientific findings more broadly
○ Experiments with press releases and tweets in different scientific domains
27.
Modelling Information Changein Scientific Communication
#Magnesium
saves lives
The study findings suggest
that increased consumption
of magnesium-rich foods
may have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
Problem: the message isn’t necessarily false, but it can be misleading
and inaccurate and lead to behavior change
28.
Proposal: General Modelof Information Change for SciComm
The study findings suggest that
increased consumption of
magnesium-rich foods may have
health benefits.
Increasing dietary magnesium intake
is associated with a reduced risk of
stroke, heart failure, diabetes, and
all-cause mortality.
In California, drone delivery of a small
package would result in about 0.42 kg
of greenhouse gas emissions.
dd
4.09
1.14
Wright et al. ”Modeling Information Change in Science Communication with Semantically
Matched Paraphrases”. In EMNLP 2022. https://aclanthology.org/2022.emnlp-main.117/
29.
Information Matching Score(IMS)
Substantial change No change
Completely different Completely the same
5
1 4
2 3
Matched findings
30.
Data
News + paperprocessing
Abstract parser
RoBERTa fine-tuned on
PubMed abstracts
F1> 0.9
Background
Objective
Methods
Results
Conclusion
17,668 41,388
733,755
45.7M potential
<news,paper> pairs, 35.6M
potential <tweet,paper> pairs
Sentence BERT
Reimers and Gurevych (2019)
0 1
Bucketed
Sample
2,400 <news,paper> pairs
and 1,200 <tweet,paper>
pairs for annotation
Semantic Paraphrase andInformation Change Dataset
(SPICED)
Computer Science
Medicine
Biology
Psychology
2,400 <news,paper> pairs
and 1,200 <tweet,paper>
annotated pairs
Computer Science
Medicine
Biology
Psychology
1,200 <news,paper> pairs
and 1,200 <tweet,paper>
Easy matched and
unmatched pairs based on
similarity
Computer Science
Medicine
Biology
Psychology
3600 annotated pairs and
2400 easy pairs
33.
SPICED vs. SemanticTextual Similarity
Beckley, who is in the department of psychology and
neuroscience at Duke, said that the adult-onset
group had a history of anti-social behavior back to
childhood, but reported committing relatively fewer
crimes.
Our results showed that most of the
adult onset men began their
antisocial activities during early
childhood.
0.38 (max 1) 4.4 (max
5)
🌶 SPICED
STS
34.
SPICED vs. othersentence matching tasks
where d is the edit distance
Measure the average normalised edit distance across the training set for matching sentences
Benchmarking
● Paraphrase Detection:RoBERTa fine-tuned on the adversarial
paraphrases (Nighojkar and Licato, 2021)
● NLI: RoBERTa fine-tuned on SNLI, MNLI, FEVER, and ANLI
● MiniLM: Sentence-BERT based on MiniLM (Wang et al. 2020)
● MPNet: SBERT based on MPNet (Song et al. 2020)
⛄
Both SBERT models are pre-trained on a corpus of >1B sentence pairs using
contrastive learning
37.
Benchmarking
RoBERTa
SciBERT
CiteBERT
MiniLM-FT
MPNet-FT
BERT models pretrainedon general
domain (RoBERTa) and scientific text
(SciBERT, CiteBERT). Fine-tuning on
SPICED by minimizing mean-squared
error between the model’s prediction and
ground truth IMS.
SBERT models fine-tuning on SPICED
by minimizing cosine distance between
the model’s prediction and ground truth
IMS
38.
Results
Paraphase/NLI models performpoorly
Best overall is SBERT + Fine-tune
Tweets are harder than news
Overall
News
Tweets
Potentially much room for
improvement (STS tasks see scores
in the 90s)
Pearson correlation
39.
Zero-shot scientific evidenceretrieval
Does training on SPICED improve performance on scientific
evidence retrieval for real-world claims?
🌶
300 claims from Twitter matched with 717
evidence sentences from news articles
CoVERT
4,086 claims from Reddit matched with 3,219
unique evidence sentences from news articles
COVID-Fact
Training
Testing
Journalist
s
RQ1: Do findingsreported by different types of outlets express
different degrees of information change from their respective
papers?
45.
RQ1: Do findingsreported by different types of outlets express
different degrees of information change from their respective
papers?
Press Release
Sci&Tech
General outlet
Information
matching score
Linear mixed effect
regression model over
1.1M matched
<news,paper> pairs
Fixed effect: Subjects
Random effect: Paper
Controls
IV
46.
RQ1: Do findingsreported by different types of outlets express
different degrees of information change from their respective
papers?
YES
Scientific findings covered by
Press Release and SciTech
generally have less informational
changes compared with findings
presented in General Outlets
Audience design in
journalism
(Roland, 2009)
47.
RQ2: Do differenttypes of social media users systematically vary in
information change when discussing scientific findings?
Scientist
s
Journalist
s
The
Public
48.
RQ2: Do differenttypes of social media users systematically vary in
information change when discussing scientific findings?
Organizational
Account age
Following Information
matching score
Linear mixed effect
regression model over 182K
matched <tweet,paper> pairs
Fixed effect: Subjects
Random effect:
Paper
Controls
IV
Followers
Verified
49.
RQ2: Do differenttypes of social media users systematically vary in
information change when discussing scientific findings?
YES
Organizational Twitter accounts
keep more original information
from the paper finding
50.
RQ2: Do differenttypes of social media users systematically vary in
information change when discussing scientific findings?
YES
Organizational Twitter accounts
keep more original information
from the paper finding
Verified + more followers change
information more
51.
RQ3: Which partsof a paper are more likely to be miscommunicated
by the media?
Scientist
s
Journalist
s
Good translation
Overstating
Exaggeration
Abstract
Introduction
Results
52.
Certaint
y
Scientists have foundthat
HIV vaccine has many side
effects!
HIV Vaccine may raise
the risk of certain
diseases
(Pei and Jurgens, 2021)
Exaggeratio
n
AI is conquering the
world!
Our new NLP model
performs better than
several human baselines
(Wright and Augenstein, 2021)
Analyzed over 1.1M matched
<news,paper> pairs
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
53.
Journalists tend todownplay
the certainty and strength of
findings in abstracts
(Pei and Jurgens, 2021)
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
54.
When comparing with
findingspresented in other
sections and especially in
limitations, the news finding
are more likely to be
exaggerated and overstated
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
55.
Journalists might failto report
the limitations of scientific
findings
(Fischhoff, 2012)
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
56.
Only studying abstractsis
not enough!
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
57.
Overview of Today’sTalk
● Introduction
○ The Life Cycle of Science Communication
● Part 1: Exaggeration Detection
○ Measuring differences in stated causal relationships
○ Experiments with health science press releases
● Part 2: Modelling Information Change
○ Modelling information change in communicating scientific findings more broadly
○ Experiments with press releases and tweets in different scientific domains
● Outlook and Conclusion
○ Future research challenges
58.
Major Takeaways
● Carefulscience communication is important
○ The general public relies on general news outlets for science news
○ Overhyping of science news erodes trust
○ Exaggeration of findings can lead to behaviour change
#Magnesium
saves lives
Twitter
59.
Major Takeaways
● Proposal:general model of information change
○ Prior work: focus on semantic textual similarity
Beckley, who is in the department of psychology and
neuroscience at Duke, said that the adult-onset
group had a history of anti-social behavior back to
childhood, but reported committing relatively fewer
crimes.
Our results showed that most
of the adult onset men began
their antisocial activities
during early childhood.
0.38 (max 1) 4.4 (max 5)
🌶 SPICED
STS
60.
Major Takeaways
● Newtask definition, datasets and bechmarking for modelling
information change in science communication
○ Diverse benchmark consisting of data from four scientific domains and three
textual domains (publications, press releases, tweets)
○ Poor zero-shot performance of related tasks (paraphrasing, natural language
inference) demonstrate novelty of task
○ Downstream improvements for scientific fact checking highlight task importance
Model: copenlu/spiced
Dataset:
copenlu/spiced
Code: copenlu/scientific-information-change
PyPi package: pip install scientific-information-change
61.
Major Takeaways
● Opensthe door to asking new research questions about broad trends in
science communication
YES
Scientific findings covered by
Press Release and SciTech
generally have less informational
changes compared with findings
presented in General Outlets
Audience design in
journalism
(Roland, 2009)
RQ1: Do findings reported by different types of outlets express different degrees of information
change from their respective papers?
62.
Future Work
● Informationchange prediction as auxiliary task for other downstream
scientific NLP tasks
○ E.g. Measuring selective reporting of findings in related work descriptions,
generating faithful summaries of scientific articles
● There is selective reporting of science news – what factors affect
journalists’ selection of scientific findings?
○ E.g. societal relevance, economic implications, entertainment value
● Information is changed in different ways thoughout the science
communication process – which types of change exist and are prevalent?
○ Taxonomy of information change needed
References
Dustin Wright, IsabelleAugenstein. Exaggeration Detection of Science Press Releases. EMNLP 2021.
Paper: https://aclanthology.org/2021.emnlp-main.845/
Code, data and models: https://github.com/copenlu/scientific-exaggeration-detection
Dustin Wright, Jiaxin Pei, David Jurgens, Isabelle Augenstein. Modeling Information Change in
Science Communication with Semantically Matched Paraphrases. EMNLP 2022.
Paper: https://aclanthology.org/2022.emnlp-main.117/
Code, data and models: http://www.copenlu.com/publication/2022_emnlp_wright/
65.
Open positions
1 PhDstudent, 1 postdoc – explainable fact checking
funded by an ERC starting grant
application deadline: 1 March 2023
start date: Autumn 2023
PhD: https://jobportal.ku.dk/phd/?show=158207
Postdoc: https://jobportal.ku.dk/videnskabelige-stillinger/?show=158206
1 PhD student – fair and accountable NLP
funded by Carlsberg Semper Ardens project
application deadline: 28 February 2023
start date: Autumn 2023
PhD: https://employment.ku.dk/all-vacancies/?show=158390
66.
Thank you! Questions?
#Magnesium
saveslives
The study findings suggest
that increased consumption
of magnesium-rich foods
may have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
Editor's Notes
#3 There’s strong scientific evidence that how science is communicated matters. For example, how a particular vaccine is framed in the media has an impact on vaccine uptake.
#4 This communication is done through process of translation from complex scientific papers written by scientists to accessible news stories written by journalists and finally to public discussions among peer groups.
#5 As a result, the public perception of and trust in science is largely shaped by how journalists present science instead of the science itself.
#6 The public consumes science via general news media, even though they see substantial issues with how science is reported. Additionally, the lack of specialised domain knowledge makes it difficult to critically evaluate science news coverage.
#7 Meta-analysis of common cookbook ingredients, extracted relative risk of cancer
#8 At the same time, its very easy for the message of science to change, as demonstrated by this real world example. Multiple aspects of the scientific finding described in Fang et al. are altered here, including
#9 The independent variable being generalized from dietary magnesium to all magnesium
#10 The strength of the claim changing from a correlational statement to conditionally causal in the news and causal on Twitter
#11 And the diseases in the finding being generalized
#12 As a result, the translated claim isn’t necessarily false, but it can potentially be misleading or inaccurate and lead to behavior change. Two options: 1) aim at creating better news coverage – better communication with journalists, uphill battle but we can do our bit as scientists; 2) build tools and resources to automatically understand and analyze these changes in information in the context of science communication, in order to broadly understand and improve the science communication process.
#13 Now there is a relationship between modelling information change in scientific communication and automatic fact checking. As just said, a difference in veracity and information change in science communication are not exactly the same. Just because there is a difference in information content does not mean there has to be a difference in veracity.To illustrate this, let’s look at how automatic fact checking generally works.
#14 Let's have a closer look at the evidence ranking step.
Prior work defines similarity very strictly, looking at the textual similarity, where all words in one sentence have to overlap with all words in the other sentence for them to be ranked highly. We are only concerned with the information contained in the findings and ignore all extraneous information such as “The study findings suggest...”, whereas trained STS models take that into account.
#17 Note: press releases are high-quality, from the universities’ websites, and EurekAlert.org
#20 T2 can be converted to T1 at decoding / testing time
#21 - The two primary components of PET are patterns and verbalizers. Patterns are cloze-style sentences which mask a single token
- Verbalizers are single tokens which capture the meaning of the task’s labels in natural language, and which the model should predict to fill in the masked slots in the provided patterns
- Given a set of pattern-verbalizer pairs (PVPs), an ensemble of models is trained on a small labeled seed dataset to predict the appropriate verbalizations of the labels in the masked slots.
- These models are then applied on unlabeled data, and the raw logits are combined as a weighted average to provide soft-labels for the unlabeled data. - A final classifier is then trained on the soft labeled data using a distillation loss based on KL-divergence.
- M are the masked LMs resulting from using the different patterns- appplied to U (set of unlabelled data) to get soft probability distributions- the final model is based on a distilled version of this ensemble of masked LMs, which is in trained using KL Divergence loss between the predictions of the PET model and the target logitsDistillation part of training the final classifier; original 1-hot vector is not used
#22 T1 (exaggeration prediction) and T2 (claim strength prediction)Main task is P_m; P_a is the complementary taskAt training time, P_a is done for both popular science communication and scientific articles; at test time, two independent predictions can be used to infer the final labelU_M is unlabelled main task data
#24 Logits of multiple verbalisers are averaged for the prediction of that class
#28 As a result, the translated claim isn’t necessarily false, but it can potentially be misleading or inaccurate and lead to behavior change. The goal of this work is to build tools and resources to automatically understand and analyze these changes in information in the context of science communication, in order to broadly understand and improve the science communication process.
#29 At a high level, the task is to measure the similarity of the scientific findings described by two scientific sentences. Here, a scientific finding is defined as “a statement that describes a particular research output of a scientific study, which could be a result, conclusion, product, or other research output.” We wish to build models which predict a scalar value for this information similarity, in order to determine which findings are matching and the degree to which the information in matched findings changes.
#30 This lead us to define the information matching score, or IMS, which is a 5-point measure of the similarity of the scientific findings described by two sentences.
#31 We start by matching scientific papers with news articles and Tweets using Altmetric, an aggregator of mentions of scientific articles online
From this pool of data, we extract potential finding sentences from the scientific papers and news articles automatically and take Tweets as is, pairing all extracted sentences between papers and news, and papers and tweets
This yields an unlabelled pool of 45.7M potential (news, paper) pairs and 35.6M (tweet, paper) pairs
To limit our set of data for annotation, we pre-select potential matches using a SentenceBERT model trained on over 1B sentence pairs
To get a range potentially highly similar and highly dissimilar pairs, we do a bucketed sample based on the predicted similarity by SBERT. In this, we sample evenly from the unlabelled data in 0.05 increment buckets
We thus select a final set of 2,400 (news, paper) pairs and 1,200 (tweet, paper) pairs for annotation, distributed evenly between four scientific fields: medicine, biology, psychology, and computer science
#32 To annotate this data, we use the
Potato annotation interface developed by David’s lab at University of Michigan, and recruit domain experts using
The prolific platform
#33 Finally, after acquiring 5 expert annotations for each pair and filtering for annotator competence using MACE, we supplement the training split of the data with 1,200 very highly similar and very dissimilar pairs as predicted by SBERT, yielding a final dataset of 3,600 annotation pairs and 2,400 supplemented easy pairs. We call the dataset “SPICED”.
#34 To contrast this task with STS, we can see stark differences in the prediction of SBERT for examples in SPICED which have very high similarity.
#35 Additionally, we see much greater lexical changes between matching pairs in SPICED and other tasks such as STS and NLI
#36 We look at multiple settings and models, including
Zero-shot transfer to SPICED as well as
Fine-tuning on SPICED
#37 For zero-shot transfer, we experiment with models pretrained on paraphrase detection and natural language inference, as well as SBERT models pretrained on over 1B sentences, including scientific text
a BERT model trained by distilling multiple BERTs into one. SBERT uses siamese BERT networks to obtain sentence embeddings for pairs of sentences, trained to minimize the distance between these two embeddings
a language model trained using permuted language modeling
#38 For fine-tuning, we look at both vanilla BERT family models for both general and scientific text as well as the same SBERT models in the zero-shot transfer setting
#39 In terms of Pearson correlation on a held out test set, we see that
SBERT does well for zero-shot transfer
The best performance is achieved by SBERT fine-tuned on SPICED
Tweets are much harder than news
And that there is potentially much room for improvement considering how well the same models perform on other similarity tasks
#40 The first application I’ll talk about is zero-shot evidence retrieval for real world scientific claims. Here, the setting is SBERT both with and without fine-tuning on SPICED and with no fine-tuning on the downstream datasets. We look at two datasets: CoVERT, which is sourced from Twitter and evidence comes from the news, and COVID-Fact, which is sourced from Reddit and evidence also comes from the newsCoVERT [163] is a dataset of scientific claims sourced from Twitter, mostly in the domain of biomedicine. We use the 300 claims and the 717 unique evidence sentences in the corpus in our experiment. COVID-Fact [206] is a semi-automatically curated dataset of claims related to COVID-19 sourced from Reddit. The corpus contains 4,086 claims with 3,219 unique evidence sentences.
#41 Starting with CoVERT we see that in terms of mean-average precision and mean reciprocal rank for retrieving ground truth evidence the best performing model by a large margin is an SBERT model fine-tuned on SPICEDa BERT model trained by distilling multiple BERTs into one. SBERT uses siamese BERT networks to obtain sentence embeddings for pairs of sentences, trained to minimize the distance between these two embeddings
MiniLM: SBERT with MiniLM as the base network [244]; we obtain sen- tence embeddings for pairs of findings and measure the cosine similarity between these two embeddings, clip the lowest score to 0, and convert this score to the range [1,5]. Note that this model was trained on over 1B sentence pairs, including from scientific text, using a contrastive learning approach where the embeddings of sentences known to be similar are trained to be closer than the embeddings of negatively sampled sentences. SBERT models represent a very strong baseline on this task, and have been used in the context of other matching tasks for fact checking including detecting previously fact-checked claims [216]. MPNet: The same setting and training data as MiniLM but with MPNet as the base network [220]. MiniLM-FT: The same MiniLM model from the zero-shot transfer setup but further fine- tuned on SPICED. The training objective is to minimize the distance between the IMS and the cosine similarity of the output embeddings of the pair of findings.
#42 Additionally, we see gains for both SBERT models when fine-tuned
#43 We see a similar story on Covid-Fact, where the best performing model involves fine-tuning
#44 And fine-tuning yields improvements for both SBERT variants. This is encouraging, as it shows that SPICED enables transfer for two models on two different datasets which contain pairs and domains that don’t exist in SPICED, namely (Twitter, news) and (Reddit, news) pairs.
#45 Finally, I’ll describe large scale trends in science communication that our models have allowed us to reveal
First, we ask if the type of news outlet has an effect on information change.
#46 For this, we build a linear mixed effect model to predict the IMS of 1.1M <news, paper> pairs which have been matched by our best performing SBERT model. We consider a pair to be matching if the predicted IMS is above 3. We include fixed effects for type of news outlet (here “Press Release”, “Science and Technology”, and “General News”) as well as the paper subject, and a random effect for each paper with >30 pairs.
#47 We find that the answer to this question is
Yes
Looking at the fixed effects, we find that
Scientific findings covered by Press Release and SciTech generally have fewer informational changes compared with findings presented in General Outlets
#48 Next, we ask if different social media users systematically vary in information change when discussing scientific findings
#49 For this, we build a linear mixed effect model over 182 thousand matched (tweet, paper) pairs using our best model, and include fixed effects for different social factors
#50 Here we again find
Yes
Looking at the fixed effects, we find
Organisational accounts tend to be more faithful
#51 Here we again find
Yes
Looking at the fixed effects, we find
Organisational accounts tend to be more faithful
#52 Finally, we ask which parts of a paper are more likely to be miscommunicated by the media
#53 Here, we further analyse the 1.1M matched findings from our first research question by classifying their degree of certainty and causal claim strength using models from both of our labs’ previous work
#54 We find that journalists tend to downplay the certainty and strength of findings in the abstracts, mirroring previous findings
#55 And that findings as presented in the limitations are more likely to be exaggerated and overstated
#56 This could be explained by known problems in the reporting of limitations by science journalists
#57 Ultimately, our finding also reveals that studying abstracts alone is not enough in the context of science communication
#67 Final message: the way science is communicated affects behaviour. Be careful in your communication with journalists, and be careful what you write on Twitter