Determining the Credibility of Science Communication

Determining the
Credibility of Science
Communication
Isabelle Augenstein*
augenstein@di.ku.dk
@IAugenstein
http://isabelleaugenstein.github.io/
*partial slide credit: Dustin Wright
SDP Workshop
10 June 2021

Supporting the Life Cycle of Research
26/08/2021 3
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis

Scholarly Document Processing
• Goal: to automatically process scientific text to support scholars
• Example NLP tasks
• Extract information about scientific concepts, e.g. drugs and proteins
• Recommend relevant papers to cite
• Challenges
• Supervised learning is hard: annotation is expensive, requiring
domain experts
• Language used is diverse across fields
• Different modalities
• Meta-data also important
26/08/2021 4

26/08/2021 5
Press Release
BBC DailyMail
The Express
Etc...
Credibility and Veracity of Science Communication

Fact Checking
26/08/2021 6
Focus on veracity
What about more subtle
forms of misinformation?

Credibility and Veracity of Science Communication
• Shortcomings of prior work
• Assumes scientific writing is credible
• Assumes claims made are supported by underlying evidence
• Examples issues
• When writing a paper
• Making claims not backed up by literature
• Missing important citations
• Presenting conclusions not supported by data
• Popular science communication
• Distortion of findings
• Exaggerations
• Outright misrepresentations
26/08/2021 7

• Cite-worthiness detection
• Detecting if a sentence should include a citation to prior work
• Useful for assistive writing of scientific papers
• Similar to claim detection in fact checking
• Exaggeration detection
• Detecting if a news article exaggerates claims made in a scientific
paper
• Useful for assistive writing & quality check of press releases
• Related to veracity prediction, but more nuanced task
Challenges Addressed In This Talk

Overview of Today’s Talk
• Introduction
• The Life Cycle of Scientific Research
• Part 1: Cite-Worthiness Detection
• The CiteWorth dataset
• Methods for cite-worthiness detection
• Part 2: Exaggeration Detection
• Task framing
• Semi-supervised learning for exaggeration detection
• Conclusion
• Future research challenges

CiteWorth: Cite-Worthiness Detection
for Improved Scientific Document
Understanding
Dustin Wright, Isabelle Augenstein
ACL 2021 (Findings)
10

Scholarly Document Processing
• Challenges
• Supervised learning is hard: annotation is expensive, requiring
domain experts
• The text is diverse across fields
• How can we improve tools for scholarly document processing
across fields?
• What training data is readily available?
26/08/2021 11

26/08/2021 12
Abstract
Sections
Figures
Captions
Citances
Paper Field

Citances in Machine Learning
26/08/2021 13
We use the model from the original BERT paper (Devlin et al. 2019).
Cite-worthiness: Is this a citance? Yes
Recommendation: What paper should be cited? Devlin et al. (2019)
Influence: Was this an influential paper? Yes
Intent: What is the purpose of the citation? Method

Cite-Worthiness Uses
26/08/2021 14
As an auxiliary task in a multi-task setup
CITE-WORTHY METHOD

26/08/2021 15
As a first step in citation recommendation
CITE-WORTHY

26/08/2021 16
For assistive document editing
CITE-WORTHY

Cite-Worthiness Datasets
• Tend to be small and limited to only a few domains
(e.g. Computer Science)
• No attention paid to how clean is the data
26/08/2021 17
We use the model from Devlin et al. (2019) as a baseline.
e.g. ungrammatical phrases

CiteWorth: Dataset Curation
26/08/2021 18
1. https://github.com/allenai/s2orc
We use the model from the original BERT paper [1].
Parenthetical author/year and bracketed numerical citations only
Citations must be at the end of a sentence
• We limit citances as follows
• Source data: S2ORC1 – millions of extracted scientific
documents from Semantic Scholar
RQ1: How can a dataset for cite-worthiness detection be automatically curated with low noise?

CiteWorth: Cleaning the Data
26/08/2021 19
We use the model from the original BERT paper (Devlin et al. 2019). This
model uses self-attention and masked language modeling.
1. Extract whole paragraphs – data is curated at the paragraph level
2. Check all the gold citation spans if they are parenthetical author/year
or bracketed numerical
3. Check if all citation spans have been extracted for each sentence
4. Check if all citation spans come at the end of a sentence
5. Remove citation spans using gold spans
6. Check if any citation markers are left over (e.g. hanging
prepositions/punctuation)

CiteWorth Final Dataset
• 1,181,793 sentences
• 10 different fields, 20,000+ paragraphs per field
• Much cleaner than a naive baseline which only
removes citation text based on gold spans
26/08/2021 20
Method Sentences Clean (%) Citation Markers Removed (%)
Naive Baseline 92.07 92.78
CiteWorth (Ours) 98.90 98.10

Predicting on Individual Sentences
26/08/2021 21
Pretrained Language Models
Transformer Network
Logistic Regression
Multi-Head
Attention
Feed-
Forward
Add & Norm
Add & Norm
Embedding
2
2. https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
RQ2: What methods are most effective for automatically detecting cite-worthy sentences?
Convolutional Recurrent Net1
1. Michael Färber, Alexander Thiemann, and Adam Ja-towt. 2018b. To Cite, or Not to Cite? DetectingCitation Contexts in
Text. InEuropean Conferenceon Information Retrieval, pages 598–603. Springer.

Predicting on Individual Sentences
26/08/2021 22
Can context improve performance?
Method P R F1
Logistic
Regression
46.65 64.88 54.28
CRNN 50.87 62.21 55.97
Transformer 47.92 71.59 57.39
BERT 55.04 69.02 61.23
SciBERT 57.03 68.08 62.06
* Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, and Sven Dähne. 2016. Investigating the Influence of Noise and
Distractors on the Interpretation of Neural Networks. arXiv preprint arXiv:1611.07270.

Predicting Multiple Sentences at Once
26/08/2021 23
Are there variations across field?
Longformer*
[CLS] !"
"
!"
# [SEP] !#
"
!#
#
!#
$ [SEP]
Pooling
Classify
Pooling
Classify
… …
Method P R F1
SciBERT 57.03 68.08 62.06
Longformer-Solo 57.21 68.00 62.14
Longformer-Ctx 59.92 77.15 67.45 Δ 5 pts
* Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. CoRR, abs/2004.05150.

Transfer Learning
• Pretrain a model and fine tune on 10 tasks (NER, relation
extraction, text classification)
• Base: Original SciBERT model fine-tuned on downstream tasks
• LM: SciBERT with MLM fine-tuning on CiteWorth
• Cite: SciBERT fine-tuned on cite-worthiness detection
• LMCite: SciBERT with MLM fine-tuning on CiteWorth + fine-
tuned on cite-worthiness
26/08/2021 24
RQ4: Can large scale cite-worthiness data be used to perform transfer learning to downstream scientific
text tasks?

Transfer Learning
26/08/2021 25
78,2
78,25
78,3
78,35
78,4
78,45
78,5
78,55
78,6
78,65
78,7
78,75
Average Across Tasks
Base LM Cite LMCite
The best average performance across tasks is MLM + cite-worthiness fine-tuning
RQ4: Can large scale cite-worthiness data be used to perform transfer learning to downstream scientific
text tasks?

Conclusions
• We introduce CiteWorth – a large, rigorously cleaned
dataset for citation-related tasks
• We show that paragraph level context is crucial to
perform cite-worthiness detection
• We show that the data is diverse with a significant
domain effect
• We show that cite-worthiness is a highly transferable
task for scientific text
26/08/2021 26

Open Questions
• How to improve domain adaptation for scientific text?
• What other useful features are there?
• Author network
• Document level context
• Other types of structure (case study: Discourse Structure)
• Other tasks using this data e.g. citation
recommendation
26/08/2021 27

Semi-Supervised Exaggeration
Detection of Health Science Press
Releases
Dustin Wright, Isabelle Augenstein
EMNLP 2021 (Main)
29

Science Communication
26/08/2021 30
Press Release
BBC DailyMail
The Express
Etc...

Problem
26/08/2021 31
https://www.sciencedaily.com/releases/2021/05/210525101658.htm
Yijun Bao, Somayyeh Soltanian-Zadeh, Sina Farsiu, Yiyang Gong. Segmentation of neurons from fluorescence calcium
recordings beyond real time. Nature Machine Intelligence, 2021; DOI: 10.1038/s42256-021-00342-x
Abstract makes a
conditionally causal claim
(”potentially enabling”)
while the press release
makes a direct causal claim

Our Contributions
• Formalisation of the problem of scientific exaggeration
detection
• Curation of benchmark dataset for scientific
exaggeration detection
• Semi-supervised method based on Pattern Exploiting
Training (PET) to address the task
26/08/2021 32

Prior Work on Understanding Exaggeration in Science
• Manual attempts
• Sumner et al. 2014 and Bratton et al. 2019: InSciOut
• Manually label 823 pairs of press releases and abstracts
• Labels: causal claim strength of conclusions, advice given,
independent and dependent variables, etc.
• Find that about 33% of press releases contain exaggerated
conclusions
• Major problem: ”dominant link between academia and the
media” are press releases
• Automatic attempts
• Li et al. 2017, Yu et al. 2019, Yu et al. 2020
• Predict causal claim strength of conclusion sentences in
abstract and press release
• No clean paired data for evaluation
26/08/2021 33

Our Work on Exaggeration Detection in Science
• The focus of this work is predicting when a press release
exaggerates a scientific paper
• We focus on predicting this using the primary finding of the
paper as written in the abstract and the press release
• We build on previous work which focuses on causal claim
strength prediction of these primary findings
26/08/2021 34

Task Formulations and Evaluation Data
26/08/2021 35

Formal Problem Definition
26/08/2021 36
! = #$, &$, '$ ( ∈ [0 … -)}
Dataset !
Source documents &$
Target documents #$ written about &$
Labels '$, where
'$ ∈ 0
0 Downplays
1 Same
2 Exaggerates
indicates if #$ exaggerates, downplays, or faithfully represents &$
Learning goal: predict ' given & and #

Task Formulations
• T1
• Entailment-like task to predict
exaggeration label
• Paired (press release, abstract) data
26/08/2021 37
ℒ"# = %
0 Downplays
1 Same
2 Exaggerates
• T2
• Text classification task to predict causal
claim strength
• Unpaired press releases and abstracts
• Final prediction compares strength of
paired press release and abstract
ℒ": =
0 No Relation
1 Correlational
2 Conditional Causal
3 Direct Causal
Label Type Language Cue
0 No Relation
1 Correlational
association, associated with, predictor, at high risk
of
2 Conditional causal
increase, decrease, lead to, effect on, contribute to,
result in (Cues indicating doubt: may, might, appear
to, probably)
3 Direct causal
increase, decrease, lead to, effective on, contribute
to, reduce, can
Li et al. 2017

Evaluation Dataset Creation
26/08/2021 38
Start with the 823 labeled pairs from
Sumner et al. 2014 and Bratton et al. 2019
(InSciOut)
Collect original abstract text from Semantic
Scholar
Match original conclusion sentences to
paraphrased annotations via ROUGE
score
Manually inspect and discard missing or
incorrect abstracts
!
Downplays +, < +.
Same +, = +.
Exaggerates +, > +.
Final label: compare annotated claim
strength (+, for press release, +. for abstract)
Total data: 663 pairs (100 training, 553 test)

Few-Shot Learning: Multi-Task PET (MT-PET)
26/08/2021 39

PET (Schick et al. 2020)
26/08/2021 40
Eating chocolate
causes happiness
! 0.01 0.21 0.15 '. ()
0 1 2 3
Traditional Classifier
Eating chocolate causes
happiness. The claim
strength is [MASK]
ℳ 0.01 0.21 0.15 '. ()
PET
m
edium
estim
ated
cautious
distorted
Pattern: transform the input to a
cloze-style question
Verbalizer: predict tokens from
the language model which reflect
the data’s labels
Large pretrained
language model
+,
+-
+.
ℳ,
ℳ-
ℳ.
/
!
0 /
Soft Labels
KL-Divergence Loss
(Unlabelled)

MT-PET
26/08/2021 41
Eating chocolate causes
happiness. The claim strength
is [MASK]
ℳ
0.01 0.21 0.15 '. ()
m
edium
estim
ated
cautious
distorted
Scientists claim eating
chocolate sometimes causes
happiness. Reporters claim
eating chocolate causes
happiness. The reporters
claims are [MASK]
0.01 0.05 '. *+
prelim
inary
identical
naive
,-
,.
,-
/
ℳ/
0-
1
2-
0-
Soft Labels
KL-Divergence Loss
(Unlabelled)
,.
/
2.
,-
3
ℳ3
2-
,.
3
2.

T1 (Exaggeration Detection) with MT-PET
26/08/2021 43
28,06
33,1
29,05
41,9
39,87 39,12
47,8 47,99 47,35
25
30
35
40
45
50
P R F1
Supervised PET MT-PET
Substantial improvements when using PET (10 points)
Further improvements with MT-PET (8 points)
Demonstrates transfer of knowledge from claim strength prediction to
exaggeration prediction

T2 (Claim Strength Prediction) with MT-PET
26/08/2021 44
49,28
51,07
49,03
55,76
58,58
56,57
56,68
60,13
57,44
45
50
55
60
P R F1
58,2
59,99
58,66
58,53
61,84 60,45
60,09 61,11
45
50
55
60
P R F1
MT-PET
outperforms
PET in both
scenarios
200 samples from T2, 100 samples from T1

T2 (Claim Strength Prediction) with MT-PET
26/08/2021 45
49,28
51,07
49,03
55,76
58,58
56,57
56,68
60,13
57,44
45
50
55
60
P R F1
58,2
59,99
58,66
58,53
61,84 60,45
60,09 61,11
45
50
55
60
P R F1
MT-PET with
200 samples
approaches
supervised
performance
with 4,500
samples

Error Analysis
• All models:
• disproportionately get pairs involving direct causal claims
incorrect
• do best for correlational claims from abstracts and claims
from press releases which are correlational or stronger
• MT-PET:
• helps the most for the most difficult category -- causal claims
26/08/2021 46

Summary
• We formalize the problem of scientific exaggeration
detection, providing two task formulations for the
problem
• We curate a set of benchmark data to evaluate
automatic methods for performing the task
• We propose MT-PET, a few-shot learning method
based on PET, which we demonstrate outperforms
strong baselines
26/08/2021 47

26/08/2021 50
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis

26/08/2021 51
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Credibility
Detection
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis
NEW

Overall Take-Aways
• Why scholarly document processing?
• Supporting the life cycle of research, from information discovery to
research impact tracking
• Why credibility detection for scholarly communication?
• Detect claims which should be backed up by evidence
(cite-worthiness detection)
• Detect inconsistencies between primary and secondary sources of
information (exaggeration detection)

Overall Take-Aways
• Overarching challenges
• Difficult NLP tasks (require understanding of pragmatics)
• Domain effects, importance of context pose further challenges
• Not well-studied yet
• Scarcity of available benchmarks
• Many opportunities for future work
• Explore more different settings
• Gather more datasets
• Methods for domain adaptation & few-shot learning
• Tools for journalists & authors

Thank you!
isabelleaugenstein.github.io
augenstein@di.ku.dk
@IAugenstein
github.com/isabelleaugenstein
26/08/2021 54

Acknowledgements
55
CopeNLU
https://copenlu.github.io/
This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under the Marie Skłodowska-Curie grant agreement No 801199.
PhD student: Dustin Wright

Presented Papers
Isabelle Augenstein. Determining the Credibility of Science
Communication. SDP workshop, 2021.
Dustin Wright, Isabelle Augenstein. CiteWorth: Cite-Worthiness
Detection for Improved Scientific Document Understanding. ACL
Findings, 2021.
Dustin Wright, Isabelle Augenstein. Semi-Supervised Exaggeration
Detection of Health Science Press Releases. EMNLP, 2021.

Determining the Credibility of Science Communication

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Determining the Credibility of Science Communication

Similar to Determining the Credibility of Science Communication (20)

More from Isabelle Augenstein

More from Isabelle Augenstein (20)

Recently uploaded

Recently uploaded (20)

Determining the Credibility of Science Communication