@eLifeInnovation
Citation Sentiment
Workshop on Open Citations (Sep 2018)
Work by David Ciudad, presented by Daniel Ecer
Data Scientist
@eLifeInnovation
Outline
About eLife
Project Overview
Athar Dataset
Inferring Sentiment from Retractions
Summary
Next
elifesciences.org 2
@eLifeInnovation
About eLife
@eLife @eLifeInnovation
3
@eLifeInnovation
-- Sir Mark Walport, UK Science Advisor
Founding member, eLife Board of Directors
5
eLife is a non-profit organisation inspired by research funders and led by scientists
Title
Illustration by Davide Bonazzi
Get involved elifesci.org/innovate
#eLifeSprint. Credit: Orquidea Real Photobook - Julieta Sarmiento Photography @orquidea.real.pho
Funding, exposure and mentorship
Supporting the open source community through the eLife Innovation Initiative
@eLifeInnovation
Project Overview
7
@eLifeInnovation
Work by David Ciudad
● ASI Fellowship project:
Investigating the context of citations
● Project continuation:
Investigate criticism / sentiment of citations
(preliminary results)
8
@eLifeInnovation
Example citation network visualisation
9
http://citationgecko.com/
@eLifeInnovation
Example citation
“Our results contrast with the high rate of XMRV detection
reported by Lombard et al. among both CFS patients and
controls, but are in agreement with recent data reported in
two large studies in the UK and in the Netherlands...”
(Switzer et al., 2010)
10
@eLifeInnovation
Why does the sentiment matter?
11
Pos
Pos
Neg
3
Citation count
“impact”
“important”
Negative citations count towards
citation count
?
@eLifeInnovation
What do we mean by sentiment?
● Polarity:
○ Positive
○ Neutral
○ Negative (criticism)
12
@eLifeInnovation
Off-the-shelf sentiment analysis...
13
Textblob+0.16 +0.07
“Our results contrast …” “but are in agreement …”
NLTK
Vader
0.0 +0.18
Expected< 0 > 0
https://github.com/elifesciences/citation-sentiment-analysis/blob/develop/notebooks/off-the-shelf-sentiment-analysis.ipynb
@eLifeInnovation
Athar Dataset
14
@eLifeInnovation
Athar Dataset
● “Sentiment Analysis of Citations using Sentence
Structure-Based Features” (June 2011, Awais Athar)
● http://cl.awaisathar.com/citation-sentiment-corpus/
● 8736 manually annotated citations
15
https://github.com/elifesciences/citation-sentiment-analysis/blob/develop/notebooks/athar-dataset.ipynb
@eLifeInnovation
Athar dataset cleaning
● There are 37 records with a citation text > 1000
characters (text shows missing sentence separator)
● Some citation texts are missing word separations or
include single character tokens
● Filter tokens, keep only:
○ known English words
○ that are at least three characters
○ except the word "no"
(we don’t remove most nltk stop words as they might have an
influence on sentiment, e.g. “won't”)
16
@eLifeInnovation
Preliminary Results - predict negative vs neutral
● Naive Bayes (bag of words):
○ ~0.77 (0.84 for more biased subset)
○ Words are stemmed
17
Negative Neutral
outperform bilingu
while similar
no updat
although constitu
limit posit
small follow
show describ
Table: Most indicative word (stems)
@eLifeInnovation
Athar Dataset Conclusion
● Dataset too small (only 280 negative citations)
● “Messy” data (could be cleaned up via source ACL
Anthology Network corpus?)
18
@eLifeInnovation
Inferring Sentiment from
Retractions
19
@eLifeInnovation
Retraction Watch Database
20
http://retractiondatabase.org/
@eLifeInnovation
Retraction Note example
21
https://hccpjournal.biomedcentral.com/articles/10.1186/s13053-018-0093-1
@eLifeInnovation
Retraction Note references
22
https://hccpjournal.biomedcentral.com/articles/10.1186/s13053-018-0093-1
Retracted article
Criticising article
@eLifeInnovation
Can we harvest criticising citations
from retraction notes?
23
@eLifeInnovation
Retraction note - not linking to criticising papers
24
https://arthritis-research.biomedcentral.com/articles/10.1186/s13075-015-0847-3
@eLifeInnovation
Retraction note - not including any references
25
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4225460/
@eLifeInnovation
Retraction Notes in PMC
26
499
Retraction
Watch
Database
error(s) in data,
results and/or
conclusions
60in PMC
@eLifeInnovation
Retraction Notes in PMC with Criticism
27
499
Retraction
Watch
Database
error(s) in data,
results and/or
conclusions
60in PMC
18
with
references 9
criticising
references
@eLifeInnovation
Retraction Notes (reality)
● Most retraction notes do not link to criticising papers
● Retraction Notes not always tagged accurately in PMC
● Retraction reason may not be machine readable
28
@eLifeInnovation
Can we still infer sentiment from retractions?
29
@eLifeInnovation
Hypothesis: Retraction ~ Negative sentiment
● 2009: Paper B1 published
● 2010: Neutral citation: “These deficiencies in NK
activity may increase viral load in CFS, incidentally a
recent study observed increases in xenotropic murine
leukemia virus-related virus (XMRV) in peripheral blood
samples of CFS patients [B1]”
● 2011: Paper B1 retracted
● 2016: Criticising citation: “The recent link between
ME/CFS and xenotropic murine leukemia virus-related
virus (XMRV) [B1] has largely been disproven and
attributed to laboratory contaminants [B2]”
30
@eLifeInnovation
Are citations after retraction more likely negative?
31
Pos Pos Neg
Cited Paper
Citations
time
Retraction
Before After
Pos NegNeg
@eLifeInnovation
Predict: before / after retraction
32
Citation
Text 1
Learn Before
Citation
Text 2
Learn After
Citation
Text 3
Learn Before
Citation
Text 4
Predict ?
@eLifeInnovation
Retraction Notes in PMC (reminder)
33
499
Retraction
Watch
Database
error(s) in data,
results and/or
conclusions
60in PMC
@eLifeInnovation
Retraction Notes in PMC with known retraction date
53
with known
retraction
date
142citing papers
34
499
Retraction
Watch
Database
error(s) in data,
results and/or
conclusions
60in PMC
@eLifeInnovation
Citing papers before/after date of retraction
142
104 38
published
after
published
before
35
citing papers
@eLifeInnovation
Citation Text - Citing Sentence
“Conversely, a recent article asserted that the variant was
associated with breast cancer, based on a relatively limited
case control association study in the Norwegian population
(Møller & Hovig, 2017).”
36
citingsentence
@eLifeInnovation
Citation Text - Citing Sentence + 3 (4 sentences)
“Conversely, a recent article asserted that the variant was associated with
breast cancer, based on a relatively limited case control association study in
the Norwegian population (Møller & Hovig, 2017).
As a consequence, to date the classification of c.68‐7T > A reported in
databases aggregating information on genomic variations has remained
inconclusive.
In particular, ClinVar reports conflicting interpretations classifying the variant as
benign (seven entries), likely benign (nine entries) and of uncertain significance
(four entries).
Moreover, the BIC database presently annotates the variant as of unknown
clinical importance, pending classification, while the BRCA Share™ classifies it
as likely benign.”
37
citing
sentence+1+2+3
https://dl.acm.org/citation.cfm?id=2382125 (2012, Context-enhanced citation sentiment detection)
@eLifeInnovation
Different models with citing sentences + context
Citing
Sentence
Model
Citing
Sentence
+3 (4)
Model
Citing
Sentence
+5 (6)
Model
@eLifeInnovation
Balance dataset (control for subject area etc.)
142
citing
104
before
38
after
53
retracted
54
citing
27
before
27
after
12
retracted
balance
@eLifeInnovation
Detailed results*
40
Undersampled
Undersampled +
reduced bias
Citing Sentence 0.846 0.750
Citing Sentence +3 (4) 0.538 0.833
Citing Sentence +5 (6) 0.615 0.583
On closer inspection, used words like “cancer”
(indication of biased dataset)
More useful
*preliminary results
@eLifeInnovation
Top 7 stemmed words for 3-sentence model*
41
After Before
remain model
present subject
assess function
fail lack
conflict discoveri
induct research
individu no
Table: Most indicative word (stems)
*preliminary results
@eLifeInnovation
Summary
42
@eLifeInnovation
Summary
Retraction
Watch
Database
Athar’s
dataset
Sentiment
Model
Before/After
Model
Retraction
Note
@eLifeInnovation
Summary (combined model)
Retraction
Watch
Database
Athar’s
dataset
Sentiment
Model
Before/After
Model
Retraction
Note
Combined
Model
@eLifeInnovation
Next
45
@eLifeInnovation
Next
● Getting more training/test data:
○ Pay for manual annotation?
○ Other ideas to automatically infer sentiment?
● Use cases for reliable sentiment detection of a citation:
○ Better metric? (Colin will talk more about metrics)
○ Identify contentious work?
○ <your suggestion here>
● More suggestions, feedback, collaboration please
46
@eLifeInnovation
Links
● https://github.com/dciudadr/eLife_retractions
● https://github.com/elifesciences/citation-sentiment-analysis
● Blog: Investigating the context of citations
47
@eLifeInnovation
Fin
48
Background: CC0, https://www.pexels.com/photo/night-television-tv-video-8158/

Citation Sentiment