Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Citation Sentiment

390 views

Published on

Presentation of the analysis of sentiment or criticism in scientific citation done by David Ciudad for eLife Sciences Publications Ltd. (https://elifesciences.org/).
The work was presented at the Workshop on Open Citations in September 2018 (https://workshop-oc.github.io/).

Published in: Science
  • Be the first to comment

  • Be the first to like this

Citation Sentiment

  1. 1. @eLifeInnovation Citation Sentiment Workshop on Open Citations (Sep 2018) Work by David Ciudad, presented by Daniel Ecer Data Scientist
  2. 2. @eLifeInnovation Outline About eLife Project Overview Athar Dataset Inferring Sentiment from Retractions Summary Next elifesciences.org 2
  3. 3. @eLifeInnovation About eLife @eLife @eLifeInnovation 3
  4. 4. @eLifeInnovation -- Sir Mark Walport, UK Science Advisor Founding member, eLife Board of Directors
  5. 5. 5 eLife is a non-profit organisation inspired by research funders and led by scientists Title
  6. 6. Illustration by Davide Bonazzi Get involved elifesci.org/innovate #eLifeSprint. Credit: Orquidea Real Photobook - Julieta Sarmiento Photography @orquidea.real.pho Funding, exposure and mentorship Supporting the open source community through the eLife Innovation Initiative
  7. 7. @eLifeInnovation Project Overview 7
  8. 8. @eLifeInnovation Work by David Ciudad ● ASI Fellowship project: Investigating the context of citations ● Project continuation: Investigate criticism / sentiment of citations (preliminary results) 8
  9. 9. @eLifeInnovation Example citation network visualisation 9 http://citationgecko.com/
  10. 10. @eLifeInnovation Example citation “Our results contrast with the high rate of XMRV detection reported by Lombard et al. among both CFS patients and controls, but are in agreement with recent data reported in two large studies in the UK and in the Netherlands...” (Switzer et al., 2010) 10
  11. 11. @eLifeInnovation Why does the sentiment matter? 11 Pos Pos Neg 3 Citation count “impact” “important” Negative citations count towards citation count ?
  12. 12. @eLifeInnovation What do we mean by sentiment? ● Polarity: ○ Positive ○ Neutral ○ Negative (criticism) 12
  13. 13. @eLifeInnovation Off-the-shelf sentiment analysis... 13 Textblob+0.16 +0.07 “Our results contrast …” “but are in agreement …” NLTK Vader 0.0 +0.18 Expected< 0 > 0 https://github.com/elifesciences/citation-sentiment-analysis/blob/develop/notebooks/off-the-shelf-sentiment-analysis.ipynb
  14. 14. @eLifeInnovation Athar Dataset 14
  15. 15. @eLifeInnovation Athar Dataset ● “Sentiment Analysis of Citations using Sentence Structure-Based Features” (June 2011, Awais Athar) ● http://cl.awaisathar.com/citation-sentiment-corpus/ ● 8736 manually annotated citations 15 https://github.com/elifesciences/citation-sentiment-analysis/blob/develop/notebooks/athar-dataset.ipynb
  16. 16. @eLifeInnovation Athar dataset cleaning ● There are 37 records with a citation text > 1000 characters (text shows missing sentence separator) ● Some citation texts are missing word separations or include single character tokens ● Filter tokens, keep only: ○ known English words ○ that are at least three characters ○ except the word "no" (we don’t remove most nltk stop words as they might have an influence on sentiment, e.g. “won't”) 16
  17. 17. @eLifeInnovation Preliminary Results - predict negative vs neutral ● Naive Bayes (bag of words): ○ ~0.77 (0.84 for more biased subset) ○ Words are stemmed 17 Negative Neutral outperform bilingu while similar no updat although constitu limit posit small follow show describ Table: Most indicative word (stems)
  18. 18. @eLifeInnovation Athar Dataset Conclusion ● Dataset too small (only 280 negative citations) ● “Messy” data (could be cleaned up via source ACL Anthology Network corpus?) 18
  19. 19. @eLifeInnovation Inferring Sentiment from Retractions 19
  20. 20. @eLifeInnovation Retraction Watch Database 20 http://retractiondatabase.org/
  21. 21. @eLifeInnovation Retraction Note example 21 https://hccpjournal.biomedcentral.com/articles/10.1186/s13053-018-0093-1
  22. 22. @eLifeInnovation Retraction Note references 22 https://hccpjournal.biomedcentral.com/articles/10.1186/s13053-018-0093-1 Retracted article Criticising article
  23. 23. @eLifeInnovation Can we harvest criticising citations from retraction notes? 23
  24. 24. @eLifeInnovation Retraction note - not linking to criticising papers 24 https://arthritis-research.biomedcentral.com/articles/10.1186/s13075-015-0847-3
  25. 25. @eLifeInnovation Retraction note - not including any references 25 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4225460/
  26. 26. @eLifeInnovation Retraction Notes in PMC 26 499 Retraction Watch Database error(s) in data, results and/or conclusions 60in PMC
  27. 27. @eLifeInnovation Retraction Notes in PMC with Criticism 27 499 Retraction Watch Database error(s) in data, results and/or conclusions 60in PMC 18 with references 9 criticising references
  28. 28. @eLifeInnovation Retraction Notes (reality) ● Most retraction notes do not link to criticising papers ● Retraction Notes not always tagged accurately in PMC ● Retraction reason may not be machine readable 28
  29. 29. @eLifeInnovation Can we still infer sentiment from retractions? 29
  30. 30. @eLifeInnovation Hypothesis: Retraction ~ Negative sentiment ● 2009: Paper B1 published ● 2010: Neutral citation: “These deficiencies in NK activity may increase viral load in CFS, incidentally a recent study observed increases in xenotropic murine leukemia virus-related virus (XMRV) in peripheral blood samples of CFS patients [B1]” ● 2011: Paper B1 retracted ● 2016: Criticising citation: “The recent link between ME/CFS and xenotropic murine leukemia virus-related virus (XMRV) [B1] has largely been disproven and attributed to laboratory contaminants [B2]” 30
  31. 31. @eLifeInnovation Are citations after retraction more likely negative? 31 Pos Pos Neg Cited Paper Citations time Retraction Before After Pos NegNeg
  32. 32. @eLifeInnovation Predict: before / after retraction 32 Citation Text 1 Learn Before Citation Text 2 Learn After Citation Text 3 Learn Before Citation Text 4 Predict ?
  33. 33. @eLifeInnovation Retraction Notes in PMC (reminder) 33 499 Retraction Watch Database error(s) in data, results and/or conclusions 60in PMC
  34. 34. @eLifeInnovation Retraction Notes in PMC with known retraction date 53 with known retraction date 142citing papers 34 499 Retraction Watch Database error(s) in data, results and/or conclusions 60in PMC
  35. 35. @eLifeInnovation Citing papers before/after date of retraction 142 104 38 published after published before 35 citing papers
  36. 36. @eLifeInnovation Citation Text - Citing Sentence “Conversely, a recent article asserted that the variant was associated with breast cancer, based on a relatively limited case control association study in the Norwegian population (Møller & Hovig, 2017).” 36 citingsentence
  37. 37. @eLifeInnovation Citation Text - Citing Sentence + 3 (4 sentences) “Conversely, a recent article asserted that the variant was associated with breast cancer, based on a relatively limited case control association study in the Norwegian population (Møller & Hovig, 2017). As a consequence, to date the classification of c.68‐7T > A reported in databases aggregating information on genomic variations has remained inconclusive. In particular, ClinVar reports conflicting interpretations classifying the variant as benign (seven entries), likely benign (nine entries) and of uncertain significance (four entries). Moreover, the BIC database presently annotates the variant as of unknown clinical importance, pending classification, while the BRCA Share™ classifies it as likely benign.” 37 citing sentence+1+2+3 https://dl.acm.org/citation.cfm?id=2382125 (2012, Context-enhanced citation sentiment detection)
  38. 38. @eLifeInnovation Different models with citing sentences + context Citing Sentence Model Citing Sentence +3 (4) Model Citing Sentence +5 (6) Model
  39. 39. @eLifeInnovation Balance dataset (control for subject area etc.) 142 citing 104 before 38 after 53 retracted 54 citing 27 before 27 after 12 retracted balance
  40. 40. @eLifeInnovation Detailed results* 40 Undersampled Undersampled + reduced bias Citing Sentence 0.846 0.750 Citing Sentence +3 (4) 0.538 0.833 Citing Sentence +5 (6) 0.615 0.583 On closer inspection, used words like “cancer” (indication of biased dataset) More useful *preliminary results
  41. 41. @eLifeInnovation Top 7 stemmed words for 3-sentence model* 41 After Before remain model present subject assess function fail lack conflict discoveri induct research individu no Table: Most indicative word (stems) *preliminary results
  42. 42. @eLifeInnovation Summary 42
  43. 43. @eLifeInnovation Summary Retraction Watch Database Athar’s dataset Sentiment Model Before/After Model Retraction Note
  44. 44. @eLifeInnovation Summary (combined model) Retraction Watch Database Athar’s dataset Sentiment Model Before/After Model Retraction Note Combined Model
  45. 45. @eLifeInnovation Next 45
  46. 46. @eLifeInnovation Next ● Getting more training/test data: ○ Pay for manual annotation? ○ Other ideas to automatically infer sentiment? ● Use cases for reliable sentiment detection of a citation: ○ Better metric? (Colin will talk more about metrics) ○ Identify contentious work? ○ <your suggestion here> ● More suggestions, feedback, collaboration please 46
  47. 47. @eLifeInnovation Links ● https://github.com/dciudadr/eLife_retractions ● https://github.com/elifesciences/citation-sentiment-analysis ● Blog: Investigating the context of citations 47
  48. 48. @eLifeInnovation Fin 48 Background: CC0, https://www.pexels.com/photo/night-television-tv-video-8158/

×