Beyond Fact Checking —
Modelling Information Change in
Scientific Communication
Isabelle Augenstein*
AAAI
11 February 2023
*credit for some slides: Dustin Wright
Scientists Journalist
s
The Public
How science is communicated matters
I can still do that
HIV Vaccine may raise risk
Never!
Scientists have found that
HIV vaccine has many side
effects!
Affects trust in science and
future actions
Kuru et al., (2019); Gustafson and Rice,
(2019); Fischhoff, (2012); Morton, (2010) https://www.nature.com/articles/450325a
The science communication process
Scientist
s
Journalist
s
The
Public
The public relies on journalists to learn scientific findings
The public perception of science
is largely shaped by how
journalists present science
instead of science itself.
… despite seeing substantial issues with how science is reported
The lack of domain-specific
scientific knowledge makes it
difficult to critically evaluate
science news coverage.
Skewed reporting of science undermines trust in science
Hyped-up polarised news
articles (”caffeine causes
cancer” / ”coffee cures cancer”)
lead to uncertainty and erosion
of trust in scientists
Schoenfeld and Ioannidis: ”Is everything we eat associated with cancer?
A systematic cookbook review”, American Journal of Clinical Nutrition,
2013. https://pubmed.ncbi.nlm.nih.gov/23193004/
https://www.vox.com/science-and-health/2019/6/11/18652225/hype-
science-press-releases
It’s easy for the message to change
Fang et al. (2016)
#Magnesium
saves lives
Reuters (2016) Twitter
The study findings suggest that
increased consumption of
magnesium-rich foods may
have health benefits.
Increasing dietary magnesium intake
is associated with a reduced risk of
stroke, heart failure, diabetes, and
all-cause mortality.
It’s easy for the message to change
#Magnesium
saves lives
The study findings suggest that
increased consumption of
magnesium-rich foods may
have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure, diabetes,
and all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
It’s easy for the message to change
#Magnesium
saves lives
The study findings suggest
that increased consumption of
magnesium-rich foods may
have health benefits.
Increasing dietary magnesium intake
is associated with a reduced risk of
stroke, heart failure, diabetes, and
all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
It’s easy for the message to change
#Magnesium
saves lives
The study findings suggest that
increased consumption of
magnesium-rich foods may
have health benefits.
Increasing dietary magnesium intake
is associated with a reduced risk of
stroke, heart failure, diabetes, and
all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
It’s easy for the message to change
#Magnesium
saves lives
The study findings suggest
that increased consumption
of magnesium-rich foods
may have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
The message isn’t necessarily false, but it can be misleading and
inaccurate and lead to behavior change
Modelling Information Change -- Automatic Fact Checking
Claim Check-
Worthiness Detection
Evidence Document
Retrieval and Ranking
Recognising Textual
Entailment
Veracity Prediction
“Magnesium saves lives”
not check-worthy
check-worthy
“Magnesium saves lives”
“Magnesium saves lives”,
“Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality”
positive
negative
neutral
true
false
not enough info
“Magnesium saves lives”
Evidence Ranking for Automatic Fact Checking
Evidence Document
Retrieval and Ranking
“Magnesium saves lives”
“Magnesium saves lives”,
“The study findings suggest that increasing dietary magnesium
intake is associated with a reduced risk of stroke, heart failure,
diabetes, and all-cause mortality”
● Notion of similarity matters
○ Strict textual similarity (most prior work)
○ Similarity of information content (proposed here)
● Domain differences increase task difficulty
○ Measure similarity between <claim, evidence> from <news, news> (most prior work)
○ Measure similarity between <claim, evidence> from <news, press release/twitter>
(proposed here)
Overview of Today’s Talk
● Introduction
○ The Life Cycle of Science Communication
● Part 1: Exaggeration Detection
○ Measuring differences in stated causal relationships
○ Experiments with health science press releases
● Part 2: Modelling Information Change
○ Modelling information change in communicating scientific findings more broadly
○ Experiments with press releases and tweets in different scientific domains
● Outlook and Conclusion
○ Future research challenges
Exaggeration Detection of Science Press Releases
The study findings suggest
that increased consumption
of magnesium-rich foods
may have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality.
Fang et al. (2016) Reuters (2016)
Problem: the strength of the claim changing from a correlational statement
(“associated with”) to conditionally causal in the news (”suggest”, “may”)
Exaggeration in Science Journalism
Sumner et al. 20141 and Bratton et al. 20192: InSciOut
Sumner, P., Vivian-Griffiths, S., Boivin, J., Williams, A., Venetis, C. A., Davies, A., ... & Chambers, C. D. (2014). The association between exaggeration in health related science
news and academic press releases: retrospective observational study. Bmj, 349.
Bratton, L., Adams, R. C., Challenger, A., Boivin, J., Bott, L., Chambers, C. D., & Sumner, P. (2019). The association between exaggeration in health-related science news and
academic press releases: a replication study. Wellcome open research, 4.
Objective: To identify the source (press releases or news) of distortions,
exaggerations, or changes to the main conclusions drawn from research that could
potentially influence a reader’s health related behaviour.
Conclusions:
• 33% of press releases contain exaggerations of conclusions of scientific papers
• Exaggeration in news is strongly associated with exaggeration in press releases
Modelling Information Change – Causal Claim Strength Prediction
Label Type Language Cue
0 No Relation
1 Correlational
association, associated with, predictor, at
high risk of
2 Conditional causal
increase, decrease, lead to, effect on,
contribute to, result in (Cues indicating
doubt: may, might, appear to, probably)
3 Direct causal
increase, decrease, lead to, effective on,
contribute to, reduce, can
Li et al. ”An NLP Analysis of Exaggerated Claims in Science News.” In NLPmJ@EMNLP, 2017.
Yu et al. ”Measuring Correlation-to-Causation Exaggeration in Press Releases”. In Coling 2020.
Our Work on Exaggeration Detection in Science
 Formalize the task of scientific exaggeration detection:
predicting when a press release exaggerates a scientific paper
 Curate a dataset from expert annotations to benchmark performance
Input: primary finding of the paper as written in the abstract and the press release
 Investigate and develop methods for automatic scientific exaggeration detection
Semi-supervised method based on Pattern Exploiting Training (PET)
Wright et al. ”Semi-Supervised Exaggeration Detection of Health Science Press Releases”. In EMNLP 2021.
https://aclanthology.org/2021.emnlp-main.845/
Task Formulations
Label Type Language Cue
0 No Relation
1 Correlational
association, associated with,
predictor, at high risk of
2 Conditional causal
increase, decrease, lead to, effect
on, contribute to, result in (Cues
indicating doubt: may, might,
appear to, probably)
3 Direct causal
increase, decrease, lead to,
effective on, contribute to,
reduce, can
Li et al. ”An NLP Analysis of Exaggerated Claims
in Science News.” In NLPmJ@EMNLP, 2017.
Exaggeration detection
• Entailment-like task
• Paired (press release, abstract) data
ℒ𝑇1 =
0 Downplays
1 Same
2 Exaggerates
Causal claim strength prediction
• Text classification task
• Unpaired press releases and abstracts
• Final prediction compares strength of
paired press release and abstract
ℒ𝑇2 =
0 No Relation
1 Correlational
2 Conditional Causal
3 Direct Causal
Pattern Exploiting Training (Schick et al. 2020)
Eating chocolate
causes happiness
𝐶 0.01 0.21 0.15 𝟎. 𝟔𝟑
0 1 2 3
Traditional Classifier
Eating chocolate causes
happiness. The claim
strength is [MASK]
ℳ 0.01 0.21 0.15 𝟎. 𝟔𝟑
PET
Pattern: transform the input to a
cloze-style question
Verbalizer: predict tokens from
the language model which reflect
the data’s labels
Large pretrained
language model
𝑃0
𝑃1
𝑃2
ℳ0
ℳ1
ℳ2
𝑈
𝐶
𝐷 𝑈
Soft Labels
KL-Divergence Loss
(Unlabelled)
MT-PET
Eating chocolate causes
happiness. The claim strength
is [MASK]
ℳ
0.01 0.21 0.15 𝟎. 𝟔𝟑
Scientists claim eating chocolate
sometimes causes happiness.
Reporters claim eating chocolate
causes happiness. The reporters
claims are [MASK]
0.01 0.05 𝟎. 𝟗𝟒
𝑃𝑚
𝑃𝑎
𝑃𝑚
0
ℳ0
𝑈𝑚
𝐶
𝐷𝑚
𝑈𝑚
Soft Labels
KL-Divergence Loss
(Unlabelled)
𝑃𝑎
0
𝐷𝑎
𝑃𝑚
1
ℳ1
𝐷𝑚
𝑃𝑎
1
𝐷𝑎
MT-PET for Exaggeration Detection
Name Pattern
𝑃𝑇1
0 Scientists claim s. Reporters claim t. The reporters claims are
[MASK]
𝑃𝑇2
0 [Scientists|Reporters] say [s |t ]. The claim strength is [MASK]
𝑃𝑇1
1 Academic literature claims s. Popular media claims t. The media
claims are [MASK]
𝑃𝑇2
1 [Academic literature|Popular media] says [s |t ]. The claim
strength is [MASK]
Our tasks are T1 (exaggeration prediction) and T2 (claim strength prediction)
We develop patterns by hand and verbalizers semi-automatically using PETAL (Schick et al. 2020)
s and t are the claim text in the abstract and press release, respectively
Exaggeration Detection Verbalisers
Pattern Label Verbalizers
𝑃𝑇1
0
Downplays preliminary, competing, uncertainties
Same following, explicit
Exaggerates mistaken, wrong, hollow, naive, false, lies
𝑃𝑇1
1
Downplays hypothetical, theoretical, conditional
Same identical
Exaggerates mistaken, wrong, premature, fantasy, noisy, artificial
𝑃𝑇2
∗
No Relation sufficient, enough, authentic, medium
Correlational inferred, estimated, calculated, borderline,
approximately, variable, roughly
Cond. Causal cautious, premature, uncertain, conflicting, limited
Causal touted, proven, replicated, promoted, distorted
T1 (Exaggeration Detection) with MT-PET
28.06
33.1
29.05
41.9
39.87 39.12
47.8 47.99 47.35
25
30
35
40
45
50
P R F1
Supervised PET MT-PET
Substantial improvements when using PET (10 points)
Further improvements with MT-PET (8 points)
Demonstrates transfer of knowledge from claim strength prediction to exaggeration prediction
Learning Dynamics for T2 (Claim Strength Prediction)
MT-PET with 200 samples
approaches performance of vanilla
PET with 500 samples
MT-PET with 200 samples
approaches performance on
supervised learning with 4,500
samples
PET always outperforms supervised
learning
Overview of Today’s Talk
● Introduction
○ The Life Cycle of Science Communication
● Part 1: Exaggeration Detection
○ Measuring differences in stated causal relationships
○ Experiments with health science press releases
● Part 2: Modelling Information Change
○ Modelling information change in communicating scientific findings more broadly
○ Experiments with press releases and tweets in different scientific domains
Modelling Information Change in Scientific Communication
#Magnesium
saves lives
The study findings suggest
that increased consumption
of magnesium-rich foods
may have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter
Problem: the message isn’t necessarily false, but it can be misleading
and inaccurate and lead to behavior change
Proposal: General Model of Information Change for SciComm
The study findings suggest that
increased consumption of
magnesium-rich foods may have
health benefits.
Increasing dietary magnesium intake
is associated with a reduced risk of
stroke, heart failure, diabetes, and
all-cause mortality.
In California, drone delivery of a small
package would result in about 0.42 kg
of greenhouse gas emissions.
dd
4.09
1.14
Wright et al. ”Modeling Information Change in Science Communication with Semantically
Matched Paraphrases”. In EMNLP 2022. https://aclanthology.org/2022.emnlp-main.117/
Information Matching Score (IMS)
Substantial change No change
Completely different Completely the same
5
1 4
2 3
Matched findings
Data
News + paper processing
Abstract parser
RoBERTa fine-tuned on
PubMed abstracts
F1> 0.9
Background
Objective
Methods
Results
Conclusion
17,668 41,388
733,755
45.7M potential
<news,paper> pairs, 35.6M
potential <tweet,paper> pairs
Sentence BERT
Reimers and Gurevych (2019)
0 1
Bucketed
Sample
2,400 <news,paper> pairs
and 1,200 <tweet,paper>
pairs for annotation
Annotation
Computer
Science
Medicine
Biology
Psychology
2,400 <news,paper> pairs
and 1,200 <tweet,paper>
pairs for annotation
🥔POTATO
Annotation UI
Computer Science
Medicine
Biology
Psychology
Domain expert annotators
Semantic Paraphrase and Information Change Dataset
(SPICED)
Computer Science
Medicine
Biology
Psychology
2,400 <news,paper> pairs
and 1,200 <tweet,paper>
annotated pairs
Computer Science
Medicine
Biology
Psychology
1,200 <news,paper> pairs
and 1,200 <tweet,paper>
Easy matched and
unmatched pairs based on
similarity
Computer Science
Medicine
Biology
Psychology
3600 annotated pairs and
2400 easy pairs
SPICED vs. Semantic Textual Similarity
Beckley, who is in the department of psychology and
neuroscience at Duke, said that the adult-onset
group had a history of anti-social behavior back to
childhood, but reported committing relatively fewer
crimes.
Our results showed that most of the
adult onset men began their
antisocial activities during early
childhood.
0.38 (max 1) 4.4 (max
5)
🌶 SPICED
STS
SPICED vs. other sentence matching tasks
where d is the edit distance
Measure the average normalised edit distance across the training set for matching sentences
Benchmarking
Paraphrase
Detection
NLI
MiniLM
MPNet
⛄
Zero-Shot Transfer
RoBERTa
SciBERT
CiteBERT
MiniLM-
FT
MPNet-FT
Fine-Tuning on 🌶
Benchmarking
● Paraphrase Detection: RoBERTa fine-tuned on the adversarial
paraphrases (Nighojkar and Licato, 2021)
● NLI: RoBERTa fine-tuned on SNLI, MNLI, FEVER, and ANLI
● MiniLM: Sentence-BERT based on MiniLM (Wang et al. 2020)
● MPNet: SBERT based on MPNet (Song et al. 2020)
⛄
Both SBERT models are pre-trained on a corpus of >1B sentence pairs using
contrastive learning
Benchmarking
RoBERTa
SciBERT
CiteBERT
MiniLM-FT
MPNet-FT
BERT models pretrained on general
domain (RoBERTa) and scientific text
(SciBERT, CiteBERT). Fine-tuning on
SPICED by minimizing mean-squared
error between the model’s prediction and
ground truth IMS.
SBERT models fine-tuning on SPICED
by minimizing cosine distance between
the model’s prediction and ground truth
IMS
Results
Paraphase/NLI models perform poorly
Best overall is SBERT + Fine-tune
Tweets are harder than news
Overall
News
Tweets
Potentially much room for
improvement (STS tasks see scores
in the 90s)
Pearson correlation
Zero-shot scientific evidence retrieval
Does training on SPICED improve performance on scientific
evidence retrieval for real-world claims?
🌶
300 claims from Twitter matched with 717
evidence sentences from news articles
CoVERT
4,086 claims from Reddit matched with 3,219
unique evidence sentences from news articles
COVID-Fact
Training
Testing
Results
CoVERT Covid-Fact
Method MAP MRR MAP MRR
BM25 12.450.00 20.780.00 35.180.00 52.980.00
MiniLM 26.840.00 37.980.00 50.110.00 64.780.00
+ FT 28.230.08 40.810.16
52.660.10 66.910.09
MPNet 25.210.00 35.540.00 52.390.00 66.210.00
+ FT 26.840.19 37.650.32 53.610.33 67.460.28
Results
CoVERT Covid-Fact
Method MAP MRR MAP MRR
BM25 12.450.00 20.780.00 35.180.00 52.980.00
MiniLM 26.840.00 37.980.00 50.110.00 64.780.00
+ FT 28.230.08 40.810.16
52.660.10 66.910.09
MPNet 25.210.00 35.540.00 52.390.00 66.210.00
+ FT 26.840.19 37.650.32 53.610.33 67.460.28
Results
CoVERT Covid-Fact
Method MAP MRR MAP MRR
BM25 12.450.00 20.780.00 35.180.00 52.980.00
MiniLM 26.840.00 37.980.00 50.110.00 64.780.00
+ FT 28.230.08 40.810.16
52.660.10 66.910.09
MPNet 25.210.00 35.540.00 52.390.00 66.210.00
+ FT 26.840.19 37.650.32 53.610.33 67.460.28
Results
CoVERT Covid-Fact
Method MAP MRR MAP MRR
BM25 12.450.00 20.780.00 35.180.00 52.980.00
MiniLM 26.840.00 37.980.00 50.110.00 64.780.00
+ FT 28.230.08 40.810.16
52.660.10 66.910.09
MPNet 25.210.00 35.540.00 52.390.00 66.210.00
+ FT 26.840.19 37.650.32 53.610.33 67.460.28
Journalist
s
RQ1: Do findings reported by different types of outlets express
different degrees of information change from their respective
papers?
RQ1: Do findings reported by different types of outlets express
different degrees of information change from their respective
papers?
Press Release
Sci&Tech
General outlet
Information
matching score
Linear mixed effect
regression model over
1.1M matched
<news,paper> pairs
Fixed effect: Subjects
Random effect: Paper
Controls
IV
RQ1: Do findings reported by different types of outlets express
different degrees of information change from their respective
papers?
YES
Scientific findings covered by
Press Release and SciTech
generally have less informational
changes compared with findings
presented in General Outlets
Audience design in
journalism
(Roland, 2009)
RQ2: Do different types of social media users systematically vary in
information change when discussing scientific findings?
Scientist
s
Journalist
s
The
Public
RQ2: Do different types of social media users systematically vary in
information change when discussing scientific findings?
Organizational
Account age
Following Information
matching score
Linear mixed effect
regression model over 182K
matched <tweet,paper> pairs
Fixed effect: Subjects
Random effect:
Paper
Controls
IV
Followers
Verified
RQ2: Do different types of social media users systematically vary in
information change when discussing scientific findings?
YES
Organizational Twitter accounts
keep more original information
from the paper finding
RQ2: Do different types of social media users systematically vary in
information change when discussing scientific findings?
YES
Organizational Twitter accounts
keep more original information
from the paper finding
Verified + more followers change
information more
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
Scientist
s
Journalist
s
Good translation
Overstating
Exaggeration
Abstract
Introduction
Results
Certaint
y
Scientists have found that
HIV vaccine has many side
effects!
HIV Vaccine may raise
the risk of certain
diseases
(Pei and Jurgens, 2021)
Exaggeratio
n
AI is conquering the
world!
Our new NLP model
performs better than
several human baselines
(Wright and Augenstein, 2021)
Analyzed over 1.1M matched
<news,paper> pairs
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
Journalists tend to downplay
the certainty and strength of
findings in abstracts
(Pei and Jurgens, 2021)
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
When comparing with
findings presented in other
sections and especially in
limitations, the news finding
are more likely to be
exaggerated and overstated
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
Journalists might fail to report
the limitations of scientific
findings
(Fischhoff, 2012)
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
Only studying abstracts is
not enough!
RQ3: Which parts of a paper are more likely to be miscommunicated
by the media?
Overview of Today’s Talk
● Introduction
○ The Life Cycle of Science Communication
● Part 1: Exaggeration Detection
○ Measuring differences in stated causal relationships
○ Experiments with health science press releases
● Part 2: Modelling Information Change
○ Modelling information change in communicating scientific findings more broadly
○ Experiments with press releases and tweets in different scientific domains
● Outlook and Conclusion
○ Future research challenges
Major Takeaways
● Careful science communication is important
○ The general public relies on general news outlets for science news
○ Overhyping of science news erodes trust
○ Exaggeration of findings can lead to behaviour change
#Magnesium
saves lives
Twitter
Major Takeaways
● Proposal: general model of information change
○ Prior work: focus on semantic textual similarity
Beckley, who is in the department of psychology and
neuroscience at Duke, said that the adult-onset
group had a history of anti-social behavior back to
childhood, but reported committing relatively fewer
crimes.
Our results showed that most
of the adult onset men began
their antisocial activities
during early childhood.
0.38 (max 1) 4.4 (max 5)
🌶 SPICED
STS
Major Takeaways
● New task definition, datasets and bechmarking for modelling
information change in science communication
○ Diverse benchmark consisting of data from four scientific domains and three
textual domains (publications, press releases, tweets)
○ Poor zero-shot performance of related tasks (paraphrasing, natural language
inference) demonstrate novelty of task
○ Downstream improvements for scientific fact checking highlight task importance
Model: copenlu/spiced
Dataset:
copenlu/spiced
Code: copenlu/scientific-information-change
PyPi package: pip install scientific-information-change
Major Takeaways
● Opens the door to asking new research questions about broad trends in
science communication
YES
Scientific findings covered by
Press Release and SciTech
generally have less informational
changes compared with findings
presented in General Outlets
Audience design in
journalism
(Roland, 2009)
RQ1: Do findings reported by different types of outlets express different degrees of information
change from their respective papers?
Future Work
● Information change prediction as auxiliary task for other downstream
scientific NLP tasks
○ E.g. Measuring selective reporting of findings in related work descriptions,
generating faithful summaries of scientific articles
● There is selective reporting of science news – what factors affect
journalists’ selection of scientific findings?
○ E.g. societal relevance, economic implications, entertainment value
● Information is changed in different ways thoughout the science
communication process – which types of change exist and are prevalent?
○ Taxonomy of information change needed
Acknowledgements
CopeNLU
https://copenlu.github.io/
Jiaxin Pei David Jurgens
Dustin Wright
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie
Skłodowska-Curie grant agreement No 801199.
References
Dustin Wright, Isabelle Augenstein. Exaggeration Detection of Science Press Releases. EMNLP 2021.
Paper: https://aclanthology.org/2021.emnlp-main.845/
Code, data and models: https://github.com/copenlu/scientific-exaggeration-detection
Dustin Wright, Jiaxin Pei, David Jurgens, Isabelle Augenstein. Modeling Information Change in
Science Communication with Semantically Matched Paraphrases. EMNLP 2022.
Paper: https://aclanthology.org/2022.emnlp-main.117/
Code, data and models: http://www.copenlu.com/publication/2022_emnlp_wright/
Open positions
1 PhD student, 1 postdoc – explainable fact checking
funded by an ERC starting grant
application deadline: 1 March 2023
start date: Autumn 2023
PhD: https://jobportal.ku.dk/phd/?show=158207
Postdoc: https://jobportal.ku.dk/videnskabelige-stillinger/?show=158206
1 PhD student – fair and accountable NLP
funded by Carlsberg Semper Ardens project
application deadline: 28 February 2023
start date: Autumn 2023
PhD: https://employment.ku.dk/all-vacancies/?show=158390
Thank you! Questions?
#Magnesium
saves lives
The study findings suggest
that increased consumption
of magnesium-rich foods
may have health benefits.
Increasing dietary magnesium
intake is associated with a reduced
risk of stroke, heart failure,
diabetes, and all-cause mortality.
Fang et al. (2016) Reuters (2016) Twitter

Beyond Fact Checking — Modelling Information Change in Scientific Communication

  • 1.
    Beyond Fact Checking— Modelling Information Change in Scientific Communication Isabelle Augenstein* AAAI 11 February 2023 *credit for some slides: Dustin Wright Scientists Journalist s The Public
  • 2.
    How science iscommunicated matters I can still do that HIV Vaccine may raise risk Never! Scientists have found that HIV vaccine has many side effects! Affects trust in science and future actions Kuru et al., (2019); Gustafson and Rice, (2019); Fischhoff, (2012); Morton, (2010) https://www.nature.com/articles/450325a
  • 3.
    The science communicationprocess Scientist s Journalist s The Public
  • 4.
    The public relieson journalists to learn scientific findings The public perception of science is largely shaped by how journalists present science instead of science itself.
  • 5.
    … despite seeingsubstantial issues with how science is reported The lack of domain-specific scientific knowledge makes it difficult to critically evaluate science news coverage.
  • 6.
    Skewed reporting ofscience undermines trust in science Hyped-up polarised news articles (”caffeine causes cancer” / ”coffee cures cancer”) lead to uncertainty and erosion of trust in scientists Schoenfeld and Ioannidis: ”Is everything we eat associated with cancer? A systematic cookbook review”, American Journal of Clinical Nutrition, 2013. https://pubmed.ncbi.nlm.nih.gov/23193004/ https://www.vox.com/science-and-health/2019/6/11/18652225/hype- science-press-releases
  • 7.
    It’s easy forthe message to change Fang et al. (2016) #Magnesium saves lives Reuters (2016) Twitter The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality.
  • 8.
    It’s easy forthe message to change #Magnesium saves lives The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality. Fang et al. (2016) Reuters (2016) Twitter
  • 9.
    It’s easy forthe message to change #Magnesium saves lives The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality. Fang et al. (2016) Reuters (2016) Twitter
  • 10.
    It’s easy forthe message to change #Magnesium saves lives The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality. Fang et al. (2016) Reuters (2016) Twitter
  • 11.
    It’s easy forthe message to change #Magnesium saves lives The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality. Fang et al. (2016) Reuters (2016) Twitter The message isn’t necessarily false, but it can be misleading and inaccurate and lead to behavior change
  • 12.
    Modelling Information Change-- Automatic Fact Checking Claim Check- Worthiness Detection Evidence Document Retrieval and Ranking Recognising Textual Entailment Veracity Prediction “Magnesium saves lives” not check-worthy check-worthy “Magnesium saves lives” “Magnesium saves lives”, “Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality” positive negative neutral true false not enough info “Magnesium saves lives”
  • 13.
    Evidence Ranking forAutomatic Fact Checking Evidence Document Retrieval and Ranking “Magnesium saves lives” “Magnesium saves lives”, “The study findings suggest that increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality” ● Notion of similarity matters ○ Strict textual similarity (most prior work) ○ Similarity of information content (proposed here) ● Domain differences increase task difficulty ○ Measure similarity between <claim, evidence> from <news, news> (most prior work) ○ Measure similarity between <claim, evidence> from <news, press release/twitter> (proposed here)
  • 14.
    Overview of Today’sTalk ● Introduction ○ The Life Cycle of Science Communication ● Part 1: Exaggeration Detection ○ Measuring differences in stated causal relationships ○ Experiments with health science press releases ● Part 2: Modelling Information Change ○ Modelling information change in communicating scientific findings more broadly ○ Experiments with press releases and tweets in different scientific domains ● Outlook and Conclusion ○ Future research challenges
  • 15.
    Exaggeration Detection ofScience Press Releases The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality. Fang et al. (2016) Reuters (2016) Problem: the strength of the claim changing from a correlational statement (“associated with”) to conditionally causal in the news (”suggest”, “may”)
  • 16.
    Exaggeration in ScienceJournalism Sumner et al. 20141 and Bratton et al. 20192: InSciOut Sumner, P., Vivian-Griffiths, S., Boivin, J., Williams, A., Venetis, C. A., Davies, A., ... & Chambers, C. D. (2014). The association between exaggeration in health related science news and academic press releases: retrospective observational study. Bmj, 349. Bratton, L., Adams, R. C., Challenger, A., Boivin, J., Bott, L., Chambers, C. D., & Sumner, P. (2019). The association between exaggeration in health-related science news and academic press releases: a replication study. Wellcome open research, 4. Objective: To identify the source (press releases or news) of distortions, exaggerations, or changes to the main conclusions drawn from research that could potentially influence a reader’s health related behaviour. Conclusions: • 33% of press releases contain exaggerations of conclusions of scientific papers • Exaggeration in news is strongly associated with exaggeration in press releases
  • 17.
    Modelling Information Change– Causal Claim Strength Prediction Label Type Language Cue 0 No Relation 1 Correlational association, associated with, predictor, at high risk of 2 Conditional causal increase, decrease, lead to, effect on, contribute to, result in (Cues indicating doubt: may, might, appear to, probably) 3 Direct causal increase, decrease, lead to, effective on, contribute to, reduce, can Li et al. ”An NLP Analysis of Exaggerated Claims in Science News.” In NLPmJ@EMNLP, 2017. Yu et al. ”Measuring Correlation-to-Causation Exaggeration in Press Releases”. In Coling 2020.
  • 18.
    Our Work onExaggeration Detection in Science  Formalize the task of scientific exaggeration detection: predicting when a press release exaggerates a scientific paper  Curate a dataset from expert annotations to benchmark performance Input: primary finding of the paper as written in the abstract and the press release  Investigate and develop methods for automatic scientific exaggeration detection Semi-supervised method based on Pattern Exploiting Training (PET) Wright et al. ”Semi-Supervised Exaggeration Detection of Health Science Press Releases”. In EMNLP 2021. https://aclanthology.org/2021.emnlp-main.845/
  • 19.
    Task Formulations Label TypeLanguage Cue 0 No Relation 1 Correlational association, associated with, predictor, at high risk of 2 Conditional causal increase, decrease, lead to, effect on, contribute to, result in (Cues indicating doubt: may, might, appear to, probably) 3 Direct causal increase, decrease, lead to, effective on, contribute to, reduce, can Li et al. ”An NLP Analysis of Exaggerated Claims in Science News.” In NLPmJ@EMNLP, 2017. Exaggeration detection • Entailment-like task • Paired (press release, abstract) data ℒ𝑇1 = 0 Downplays 1 Same 2 Exaggerates Causal claim strength prediction • Text classification task • Unpaired press releases and abstracts • Final prediction compares strength of paired press release and abstract ℒ𝑇2 = 0 No Relation 1 Correlational 2 Conditional Causal 3 Direct Causal
  • 20.
    Pattern Exploiting Training(Schick et al. 2020) Eating chocolate causes happiness 𝐶 0.01 0.21 0.15 𝟎. 𝟔𝟑 0 1 2 3 Traditional Classifier Eating chocolate causes happiness. The claim strength is [MASK] ℳ 0.01 0.21 0.15 𝟎. 𝟔𝟑 PET Pattern: transform the input to a cloze-style question Verbalizer: predict tokens from the language model which reflect the data’s labels Large pretrained language model 𝑃0 𝑃1 𝑃2 ℳ0 ℳ1 ℳ2 𝑈 𝐶 𝐷 𝑈 Soft Labels KL-Divergence Loss (Unlabelled)
  • 21.
    MT-PET Eating chocolate causes happiness.The claim strength is [MASK] ℳ 0.01 0.21 0.15 𝟎. 𝟔𝟑 Scientists claim eating chocolate sometimes causes happiness. Reporters claim eating chocolate causes happiness. The reporters claims are [MASK] 0.01 0.05 𝟎. 𝟗𝟒 𝑃𝑚 𝑃𝑎 𝑃𝑚 0 ℳ0 𝑈𝑚 𝐶 𝐷𝑚 𝑈𝑚 Soft Labels KL-Divergence Loss (Unlabelled) 𝑃𝑎 0 𝐷𝑎 𝑃𝑚 1 ℳ1 𝐷𝑚 𝑃𝑎 1 𝐷𝑎
  • 22.
    MT-PET for ExaggerationDetection Name Pattern 𝑃𝑇1 0 Scientists claim s. Reporters claim t. The reporters claims are [MASK] 𝑃𝑇2 0 [Scientists|Reporters] say [s |t ]. The claim strength is [MASK] 𝑃𝑇1 1 Academic literature claims s. Popular media claims t. The media claims are [MASK] 𝑃𝑇2 1 [Academic literature|Popular media] says [s |t ]. The claim strength is [MASK] Our tasks are T1 (exaggeration prediction) and T2 (claim strength prediction) We develop patterns by hand and verbalizers semi-automatically using PETAL (Schick et al. 2020) s and t are the claim text in the abstract and press release, respectively
  • 23.
    Exaggeration Detection Verbalisers PatternLabel Verbalizers 𝑃𝑇1 0 Downplays preliminary, competing, uncertainties Same following, explicit Exaggerates mistaken, wrong, hollow, naive, false, lies 𝑃𝑇1 1 Downplays hypothetical, theoretical, conditional Same identical Exaggerates mistaken, wrong, premature, fantasy, noisy, artificial 𝑃𝑇2 ∗ No Relation sufficient, enough, authentic, medium Correlational inferred, estimated, calculated, borderline, approximately, variable, roughly Cond. Causal cautious, premature, uncertain, conflicting, limited Causal touted, proven, replicated, promoted, distorted
  • 24.
    T1 (Exaggeration Detection)with MT-PET 28.06 33.1 29.05 41.9 39.87 39.12 47.8 47.99 47.35 25 30 35 40 45 50 P R F1 Supervised PET MT-PET Substantial improvements when using PET (10 points) Further improvements with MT-PET (8 points) Demonstrates transfer of knowledge from claim strength prediction to exaggeration prediction
  • 25.
    Learning Dynamics forT2 (Claim Strength Prediction) MT-PET with 200 samples approaches performance of vanilla PET with 500 samples MT-PET with 200 samples approaches performance on supervised learning with 4,500 samples PET always outperforms supervised learning
  • 26.
    Overview of Today’sTalk ● Introduction ○ The Life Cycle of Science Communication ● Part 1: Exaggeration Detection ○ Measuring differences in stated causal relationships ○ Experiments with health science press releases ● Part 2: Modelling Information Change ○ Modelling information change in communicating scientific findings more broadly ○ Experiments with press releases and tweets in different scientific domains
  • 27.
    Modelling Information Changein Scientific Communication #Magnesium saves lives The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality. Fang et al. (2016) Reuters (2016) Twitter Problem: the message isn’t necessarily false, but it can be misleading and inaccurate and lead to behavior change
  • 28.
    Proposal: General Modelof Information Change for SciComm The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality. In California, drone delivery of a small package would result in about 0.42 kg of greenhouse gas emissions. dd 4.09 1.14 Wright et al. ”Modeling Information Change in Science Communication with Semantically Matched Paraphrases”. In EMNLP 2022. https://aclanthology.org/2022.emnlp-main.117/
  • 29.
    Information Matching Score(IMS) Substantial change No change Completely different Completely the same 5 1 4 2 3 Matched findings
  • 30.
    Data News + paperprocessing Abstract parser RoBERTa fine-tuned on PubMed abstracts F1> 0.9 Background Objective Methods Results Conclusion 17,668 41,388 733,755 45.7M potential <news,paper> pairs, 35.6M potential <tweet,paper> pairs Sentence BERT Reimers and Gurevych (2019) 0 1 Bucketed Sample 2,400 <news,paper> pairs and 1,200 <tweet,paper> pairs for annotation
  • 31.
    Annotation Computer Science Medicine Biology Psychology 2,400 <news,paper> pairs and1,200 <tweet,paper> pairs for annotation 🥔POTATO Annotation UI Computer Science Medicine Biology Psychology Domain expert annotators
  • 32.
    Semantic Paraphrase andInformation Change Dataset (SPICED) Computer Science Medicine Biology Psychology 2,400 <news,paper> pairs and 1,200 <tweet,paper> annotated pairs Computer Science Medicine Biology Psychology 1,200 <news,paper> pairs and 1,200 <tweet,paper> Easy matched and unmatched pairs based on similarity Computer Science Medicine Biology Psychology 3600 annotated pairs and 2400 easy pairs
  • 33.
    SPICED vs. SemanticTextual Similarity Beckley, who is in the department of psychology and neuroscience at Duke, said that the adult-onset group had a history of anti-social behavior back to childhood, but reported committing relatively fewer crimes. Our results showed that most of the adult onset men began their antisocial activities during early childhood. 0.38 (max 1) 4.4 (max 5) 🌶 SPICED STS
  • 34.
    SPICED vs. othersentence matching tasks where d is the edit distance Measure the average normalised edit distance across the training set for matching sentences
  • 35.
  • 36.
    Benchmarking ● Paraphrase Detection:RoBERTa fine-tuned on the adversarial paraphrases (Nighojkar and Licato, 2021) ● NLI: RoBERTa fine-tuned on SNLI, MNLI, FEVER, and ANLI ● MiniLM: Sentence-BERT based on MiniLM (Wang et al. 2020) ● MPNet: SBERT based on MPNet (Song et al. 2020) ⛄ Both SBERT models are pre-trained on a corpus of >1B sentence pairs using contrastive learning
  • 37.
    Benchmarking RoBERTa SciBERT CiteBERT MiniLM-FT MPNet-FT BERT models pretrainedon general domain (RoBERTa) and scientific text (SciBERT, CiteBERT). Fine-tuning on SPICED by minimizing mean-squared error between the model’s prediction and ground truth IMS. SBERT models fine-tuning on SPICED by minimizing cosine distance between the model’s prediction and ground truth IMS
  • 38.
    Results Paraphase/NLI models performpoorly Best overall is SBERT + Fine-tune Tweets are harder than news Overall News Tweets Potentially much room for improvement (STS tasks see scores in the 90s) Pearson correlation
  • 39.
    Zero-shot scientific evidenceretrieval Does training on SPICED improve performance on scientific evidence retrieval for real-world claims? 🌶 300 claims from Twitter matched with 717 evidence sentences from news articles CoVERT 4,086 claims from Reddit matched with 3,219 unique evidence sentences from news articles COVID-Fact Training Testing
  • 40.
    Results CoVERT Covid-Fact Method MAPMRR MAP MRR BM25 12.450.00 20.780.00 35.180.00 52.980.00 MiniLM 26.840.00 37.980.00 50.110.00 64.780.00 + FT 28.230.08 40.810.16 52.660.10 66.910.09 MPNet 25.210.00 35.540.00 52.390.00 66.210.00 + FT 26.840.19 37.650.32 53.610.33 67.460.28
  • 41.
    Results CoVERT Covid-Fact Method MAPMRR MAP MRR BM25 12.450.00 20.780.00 35.180.00 52.980.00 MiniLM 26.840.00 37.980.00 50.110.00 64.780.00 + FT 28.230.08 40.810.16 52.660.10 66.910.09 MPNet 25.210.00 35.540.00 52.390.00 66.210.00 + FT 26.840.19 37.650.32 53.610.33 67.460.28
  • 42.
    Results CoVERT Covid-Fact Method MAPMRR MAP MRR BM25 12.450.00 20.780.00 35.180.00 52.980.00 MiniLM 26.840.00 37.980.00 50.110.00 64.780.00 + FT 28.230.08 40.810.16 52.660.10 66.910.09 MPNet 25.210.00 35.540.00 52.390.00 66.210.00 + FT 26.840.19 37.650.32 53.610.33 67.460.28
  • 43.
    Results CoVERT Covid-Fact Method MAPMRR MAP MRR BM25 12.450.00 20.780.00 35.180.00 52.980.00 MiniLM 26.840.00 37.980.00 50.110.00 64.780.00 + FT 28.230.08 40.810.16 52.660.10 66.910.09 MPNet 25.210.00 35.540.00 52.390.00 66.210.00 + FT 26.840.19 37.650.32 53.610.33 67.460.28
  • 44.
    Journalist s RQ1: Do findingsreported by different types of outlets express different degrees of information change from their respective papers?
  • 45.
    RQ1: Do findingsreported by different types of outlets express different degrees of information change from their respective papers? Press Release Sci&Tech General outlet Information matching score Linear mixed effect regression model over 1.1M matched <news,paper> pairs Fixed effect: Subjects Random effect: Paper Controls IV
  • 46.
    RQ1: Do findingsreported by different types of outlets express different degrees of information change from their respective papers? YES Scientific findings covered by Press Release and SciTech generally have less informational changes compared with findings presented in General Outlets Audience design in journalism (Roland, 2009)
  • 47.
    RQ2: Do differenttypes of social media users systematically vary in information change when discussing scientific findings? Scientist s Journalist s The Public
  • 48.
    RQ2: Do differenttypes of social media users systematically vary in information change when discussing scientific findings? Organizational Account age Following Information matching score Linear mixed effect regression model over 182K matched <tweet,paper> pairs Fixed effect: Subjects Random effect: Paper Controls IV Followers Verified
  • 49.
    RQ2: Do differenttypes of social media users systematically vary in information change when discussing scientific findings? YES Organizational Twitter accounts keep more original information from the paper finding
  • 50.
    RQ2: Do differenttypes of social media users systematically vary in information change when discussing scientific findings? YES Organizational Twitter accounts keep more original information from the paper finding Verified + more followers change information more
  • 51.
    RQ3: Which partsof a paper are more likely to be miscommunicated by the media? Scientist s Journalist s Good translation Overstating Exaggeration Abstract Introduction Results
  • 52.
    Certaint y Scientists have foundthat HIV vaccine has many side effects! HIV Vaccine may raise the risk of certain diseases (Pei and Jurgens, 2021) Exaggeratio n AI is conquering the world! Our new NLP model performs better than several human baselines (Wright and Augenstein, 2021) Analyzed over 1.1M matched <news,paper> pairs RQ3: Which parts of a paper are more likely to be miscommunicated by the media?
  • 53.
    Journalists tend todownplay the certainty and strength of findings in abstracts (Pei and Jurgens, 2021) RQ3: Which parts of a paper are more likely to be miscommunicated by the media?
  • 54.
    When comparing with findingspresented in other sections and especially in limitations, the news finding are more likely to be exaggerated and overstated RQ3: Which parts of a paper are more likely to be miscommunicated by the media?
  • 55.
    Journalists might failto report the limitations of scientific findings (Fischhoff, 2012) RQ3: Which parts of a paper are more likely to be miscommunicated by the media?
  • 56.
    Only studying abstractsis not enough! RQ3: Which parts of a paper are more likely to be miscommunicated by the media?
  • 57.
    Overview of Today’sTalk ● Introduction ○ The Life Cycle of Science Communication ● Part 1: Exaggeration Detection ○ Measuring differences in stated causal relationships ○ Experiments with health science press releases ● Part 2: Modelling Information Change ○ Modelling information change in communicating scientific findings more broadly ○ Experiments with press releases and tweets in different scientific domains ● Outlook and Conclusion ○ Future research challenges
  • 58.
    Major Takeaways ● Carefulscience communication is important ○ The general public relies on general news outlets for science news ○ Overhyping of science news erodes trust ○ Exaggeration of findings can lead to behaviour change #Magnesium saves lives Twitter
  • 59.
    Major Takeaways ● Proposal:general model of information change ○ Prior work: focus on semantic textual similarity Beckley, who is in the department of psychology and neuroscience at Duke, said that the adult-onset group had a history of anti-social behavior back to childhood, but reported committing relatively fewer crimes. Our results showed that most of the adult onset men began their antisocial activities during early childhood. 0.38 (max 1) 4.4 (max 5) 🌶 SPICED STS
  • 60.
    Major Takeaways ● Newtask definition, datasets and bechmarking for modelling information change in science communication ○ Diverse benchmark consisting of data from four scientific domains and three textual domains (publications, press releases, tweets) ○ Poor zero-shot performance of related tasks (paraphrasing, natural language inference) demonstrate novelty of task ○ Downstream improvements for scientific fact checking highlight task importance Model: copenlu/spiced Dataset: copenlu/spiced Code: copenlu/scientific-information-change PyPi package: pip install scientific-information-change
  • 61.
    Major Takeaways ● Opensthe door to asking new research questions about broad trends in science communication YES Scientific findings covered by Press Release and SciTech generally have less informational changes compared with findings presented in General Outlets Audience design in journalism (Roland, 2009) RQ1: Do findings reported by different types of outlets express different degrees of information change from their respective papers?
  • 62.
    Future Work ● Informationchange prediction as auxiliary task for other downstream scientific NLP tasks ○ E.g. Measuring selective reporting of findings in related work descriptions, generating faithful summaries of scientific articles ● There is selective reporting of science news – what factors affect journalists’ selection of scientific findings? ○ E.g. societal relevance, economic implications, entertainment value ● Information is changed in different ways thoughout the science communication process – which types of change exist and are prevalent? ○ Taxonomy of information change needed
  • 63.
    Acknowledgements CopeNLU https://copenlu.github.io/ Jiaxin Pei DavidJurgens Dustin Wright This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 801199.
  • 64.
    References Dustin Wright, IsabelleAugenstein. Exaggeration Detection of Science Press Releases. EMNLP 2021. Paper: https://aclanthology.org/2021.emnlp-main.845/ Code, data and models: https://github.com/copenlu/scientific-exaggeration-detection Dustin Wright, Jiaxin Pei, David Jurgens, Isabelle Augenstein. Modeling Information Change in Science Communication with Semantically Matched Paraphrases. EMNLP 2022. Paper: https://aclanthology.org/2022.emnlp-main.117/ Code, data and models: http://www.copenlu.com/publication/2022_emnlp_wright/
  • 65.
    Open positions 1 PhDstudent, 1 postdoc – explainable fact checking funded by an ERC starting grant application deadline: 1 March 2023 start date: Autumn 2023 PhD: https://jobportal.ku.dk/phd/?show=158207 Postdoc: https://jobportal.ku.dk/videnskabelige-stillinger/?show=158206 1 PhD student – fair and accountable NLP funded by Carlsberg Semper Ardens project application deadline: 28 February 2023 start date: Autumn 2023 PhD: https://employment.ku.dk/all-vacancies/?show=158390
  • 66.
    Thank you! Questions? #Magnesium saveslives The study findings suggest that increased consumption of magnesium-rich foods may have health benefits. Increasing dietary magnesium intake is associated with a reduced risk of stroke, heart failure, diabetes, and all-cause mortality. Fang et al. (2016) Reuters (2016) Twitter

Editor's Notes

  • #3 There’s strong scientific evidence that how science is communicated matters. For example, how a particular vaccine is framed in the media has an impact on vaccine uptake.
  • #4 This communication is done through process of translation from complex scientific papers written by scientists to accessible news stories written by journalists and finally to public discussions among peer groups.
  • #5 As a result, the public perception of and trust in science is largely shaped by how journalists present science instead of the science itself.
  • #6 The public consumes science via general news media, even though they see substantial issues with how science is reported. Additionally, the lack of specialised domain knowledge makes it difficult to critically evaluate science news coverage.
  • #7 Meta-analysis of common cookbook ingredients, extracted relative risk of cancer
  • #8 At the same time, its very easy for the message of science to change, as demonstrated by this real world example. Multiple aspects of the scientific finding described in Fang et al. are altered here, including
  • #9 The independent variable being generalized from dietary magnesium to all magnesium
  • #10 The strength of the claim changing from a correlational statement to conditionally causal in the news and causal on Twitter
  • #11 And the diseases in the finding being generalized
  • #12 As a result, the translated claim isn’t necessarily false, but it can potentially be misleading or inaccurate and lead to behavior change. Two options: 1) aim at creating better news coverage – better communication with journalists, uphill battle but we can do our bit as scientists; 2) build tools and resources to automatically understand and analyze these changes in information in the context of science communication, in order to broadly understand and improve the science communication process.
  • #13 Now there is a relationship between modelling information change in scientific communication and automatic fact checking. As just said, a difference in veracity and information change in science communication are not exactly the same. Just because there is a difference in information content does not mean there has to be a difference in veracity. To illustrate this, let’s look at how automatic fact checking generally works.
  • #14 Let's have a closer look at the evidence ranking step. Prior work defines similarity very strictly, looking at the textual similarity, where all words in one sentence have to overlap with all words in the other sentence for them to be ranked highly. We are only concerned with the information contained in the findings and ignore all extraneous information such as “The study findings suggest...”, whereas trained STS models take that into account.
  • #17 Note: press releases are high-quality, from the universities’ websites, and EurekAlert.org
  • #20 T2 can be converted to T1 at decoding / testing time
  • #21 - The two primary components of PET are patterns and verbalizers. Patterns are cloze-style sentences which mask a single token - Verbalizers are single tokens which capture the meaning of the task’s labels in natural language, and which the model should predict to fill in the masked slots in the provided patterns - Given a set of pattern-verbalizer pairs (PVPs), an ensemble of models is trained on a small labeled seed dataset to predict the appropriate verbalizations of the labels in the masked slots. - These models are then applied on unlabeled data, and the raw logits are combined as a weighted average to provide soft-labels for the unlabeled data. - A final classifier is then trained on the soft labeled data using a distillation loss based on KL-divergence. - M are the masked LMs resulting from using the different patterns - appplied to U (set of unlabelled data) to get soft probability distributions - the final model is based on a distilled version of this ensemble of masked LMs, which is in trained using KL Divergence loss between the predictions of the PET model and the target logits Distillation part of training the final classifier; original 1-hot vector is not used
  • #22 T1 (exaggeration prediction) and T2 (claim strength prediction) Main task is P_m; P_a is the complementary task At training time, P_a is done for both popular science communication and scientific articles; at test time, two independent predictions can be used to infer the final label U_M is unlabelled main task data
  • #24 Logits of multiple verbalisers are averaged for the prediction of that class
  • #28 As a result, the translated claim isn’t necessarily false, but it can potentially be misleading or inaccurate and lead to behavior change. The goal of this work is to build tools and resources to automatically understand and analyze these changes in information in the context of science communication, in order to broadly understand and improve the science communication process.
  • #29 At a high level, the task is to measure the similarity of the scientific findings described by two scientific sentences. Here, a scientific finding is defined as “a statement that describes a particular research output of a scientific study, which could be a result, conclusion, product, or other research output.” We wish to build models which predict a scalar value for this information similarity, in order to determine which findings are matching and the degree to which the information in matched findings changes.
  • #30 This lead us to define the information matching score, or IMS, which is a 5-point measure of the similarity of the scientific findings described by two sentences.
  • #31 We start by matching scientific papers with news articles and Tweets using Altmetric, an aggregator of mentions of scientific articles online From this pool of data, we extract potential finding sentences from the scientific papers and news articles automatically and take Tweets as is, pairing all extracted sentences between papers and news, and papers and tweets This yields an unlabelled pool of 45.7M potential (news, paper) pairs and 35.6M (tweet, paper) pairs To limit our set of data for annotation, we pre-select potential matches using a SentenceBERT model trained on over 1B sentence pairs To get a range potentially highly similar and highly dissimilar pairs, we do a bucketed sample based on the predicted similarity by SBERT. In this, we sample evenly from the unlabelled data in 0.05 increment buckets We thus select a final set of 2,400 (news, paper) pairs and 1,200 (tweet, paper) pairs for annotation, distributed evenly between four scientific fields: medicine, biology, psychology, and computer science
  • #32 To annotate this data, we use the Potato annotation interface developed by David’s lab at University of Michigan, and recruit domain experts using The prolific platform
  • #33 Finally, after acquiring 5 expert annotations for each pair and filtering for annotator competence using MACE, we supplement the training split of the data with 1,200 very highly similar and very dissimilar pairs as predicted by SBERT, yielding a final dataset of 3,600 annotation pairs and 2,400 supplemented easy pairs. We call the dataset “SPICED”.
  • #34 To contrast this task with STS, we can see stark differences in the prediction of SBERT for examples in SPICED which have very high similarity.
  • #35 Additionally, we see much greater lexical changes between matching pairs in SPICED and other tasks such as STS and NLI
  • #36 We look at multiple settings and models, including Zero-shot transfer to SPICED as well as Fine-tuning on SPICED
  • #37 For zero-shot transfer, we experiment with models pretrained on paraphrase detection and natural language inference, as well as SBERT models pretrained on over 1B sentences, including scientific text a BERT model trained by distilling multiple BERTs into one. SBERT uses siamese BERT networks to obtain sentence embeddings for pairs of sentences, trained to minimize the distance between these two embeddings a language model trained using permuted language modeling
  • #38 For fine-tuning, we look at both vanilla BERT family models for both general and scientific text as well as the same SBERT models in the zero-shot transfer setting
  • #39 In terms of Pearson correlation on a held out test set, we see that SBERT does well for zero-shot transfer The best performance is achieved by SBERT fine-tuned on SPICED Tweets are much harder than news And that there is potentially much room for improvement considering how well the same models perform on other similarity tasks
  • #40 The first application I’ll talk about is zero-shot evidence retrieval for real world scientific claims. Here, the setting is SBERT both with and without fine-tuning on SPICED and with no fine-tuning on the downstream datasets. We look at two datasets: CoVERT, which is sourced from Twitter and evidence comes from the news, and COVID-Fact, which is sourced from Reddit and evidence also comes from the news CoVERT [163] is a dataset of scientific claims sourced from Twitter, mostly in the domain of biomedicine. We use the 300 claims and the 717 unique evidence sentences in the corpus in our experiment. COVID-Fact [206] is a semi-automatically curated dataset of claims related to COVID-19 sourced from Reddit. The corpus contains 4,086 claims with 3,219 unique evidence sentences.
  • #41 Starting with CoVERT we see that in terms of mean-average precision and mean reciprocal rank for retrieving ground truth evidence the best performing model by a large margin is an SBERT model fine-tuned on SPICED a BERT model trained by distilling multiple BERTs into one. SBERT uses siamese BERT networks to obtain sentence embeddings for pairs of sentences, trained to minimize the distance between these two embeddings MiniLM: SBERT with MiniLM as the base network [244]; we obtain sen- tence embeddings for pairs of findings and measure the cosine similarity between these two embeddings, clip the lowest score to 0, and convert this score to the range [1,5]. Note that this model was trained on over 1B sentence pairs, including from scientific text, using a contrastive learning approach where the embeddings of sentences known to be similar are trained to be closer than the embeddings of negatively sampled sentences. SBERT models represent a very strong baseline on this task, and have been used in the context of other matching tasks for fact checking including detecting previously fact-checked claims [216]. MPNet: The same setting and training data as MiniLM but with MPNet as the base network [220]. MiniLM-FT: The same MiniLM model from the zero-shot transfer setup but further fine- tuned on SPICED. The training objective is to minimize the distance between the IMS and the cosine similarity of the output embeddings of the pair of findings.
  • #42 Additionally, we see gains for both SBERT models when fine-tuned
  • #43 We see a similar story on Covid-Fact, where the best performing model involves fine-tuning
  • #44 And fine-tuning yields improvements for both SBERT variants. This is encouraging, as it shows that SPICED enables transfer for two models on two different datasets which contain pairs and domains that don’t exist in SPICED, namely (Twitter, news) and (Reddit, news) pairs.
  • #45 Finally, I’ll describe large scale trends in science communication that our models have allowed us to reveal First, we ask if the type of news outlet has an effect on information change.
  • #46 For this, we build a linear mixed effect model to predict the IMS of 1.1M <news, paper> pairs which have been matched by our best performing SBERT model. We consider a pair to be matching if the predicted IMS is above 3. We include fixed effects for type of news outlet (here “Press Release”, “Science and Technology”, and “General News”) as well as the paper subject, and a random effect for each paper with >30 pairs.
  • #47 We find that the answer to this question is Yes Looking at the fixed effects, we find that Scientific findings covered by Press Release and SciTech generally have fewer informational changes compared with findings presented in General Outlets
  • #48 Next, we ask if different social media users systematically vary in information change when discussing scientific findings
  • #49 For this, we build a linear mixed effect model over 182 thousand matched (tweet, paper) pairs using our best model, and include fixed effects for different social factors
  • #50 Here we again find Yes Looking at the fixed effects, we find Organisational accounts tend to be more faithful
  • #51 Here we again find Yes Looking at the fixed effects, we find Organisational accounts tend to be more faithful
  • #52 Finally, we ask which parts of a paper are more likely to be miscommunicated by the media
  • #53 Here, we further analyse the 1.1M matched findings from our first research question by classifying their degree of certainty and causal claim strength using models from both of our labs’ previous work
  • #54 We find that journalists tend to downplay the certainty and strength of findings in the abstracts, mirroring previous findings
  • #55 And that findings as presented in the limitations are more likely to be exaggerated and overstated
  • #56 This could be explained by known problems in the reporting of limitations by science journalists
  • #57 Ultimately, our finding also reveals that studying abstracts alone is not enough in the context of science communication
  • #67 Final message: the way science is communicated affects behaviour. Be careful in your communication with journalists, and be careful what you write on Twitter