Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amanuens.is HUmans and machines annotating scholarly literature

617 views

Published on

about 10,000 scholarly articles ("papers") are published each day. Amanuens.is is a symbiont of ContentMine and Hypothes.is (both Shuttleworth projects/Fellows) which annotates theses using an array of controlled vocabularies ("dictionaries"). The results, in semantic form are used to annotate the original material. The talk had live demos and used plant chemistry as the examples

Published in: Science
  • Be the first to comment

Amanuens.is HUmans and machines annotating scholarly literature

  1. 1. Amanuens.is ContentMine IAnnotate!, Berlin, DE, 2016- 05-18 Peter Murray-Rust [1]University of Cambridge [2]TheContentMine ContentMine + Hypothesis annotate the scientific literature! 100, 000 + per day. Live demos!
  2. 2. Scholarly publishing • 10, 000 articles per day • 20 Billion USD / year [1] • Totally and scandalously broken. Primary revenue comes from throttling the flow of knowledge • Massive disruption likely (Sci-Hub) • Mining and annotation liberation tools. [1] (2x digital music industry!)
  3. 3. (2x digital music industry!)
  4. 4. • Science can be read and understood by human-machine Amanuensis-symbionts. • Amanuenses based on Wikipedia Wikidata, software (ContentMine’s AMI) • Results are fed back into WP and WikiData • Annotation through Hypothes.is http://en.wikipedia.org/wiki/Symbiosishttp://en.wikipedia.org/wiki/Eric_Fenby
  5. 5. What plants produce Carvone? https://en.wikipedia.org/wiki/Carvone https://en.wikipedia.org/wiki/Carvone
  6. 6. https://en.wikipedia.org/wiki/Carvone WIKIDATA
  7. 7. Carvone in Wikidata
  8. 8. Search for carvone
  9. 9. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts Latest 20150908
  10. 10. Mining for phytochemicals • getpapers –q carvone –o carvone –x –k 100 Search “carvone”, output to carvone/, fmt XML, limit 100 hits • cmine carvone Normalize papers; search locally for species, sequences, diseases, drugs Results in dataTables.html and results/…/results.xml (includes W3C annotation) • python cmhypy.py carvone/ -u petermr <key> send annotations -> hypothes.is
  11. 11. Annotation (entity in context) prefix surface label location suffix
  12. 12. articles facets gene disease drug Phyto chem species genus words
  13. 13. Remote & Local papers Disease ICD-10 phytochemicals species Commonest words
  14. 14. Mining for phytochemicals • getpapers –q carvone –o carvone –x –k 100 Search “carvone”, output to carvone/, fmt XML, limit 100 hits • cmine carvone Normalize papers; search locally for species, sequences, diseases, drugs Results in dataTables.html and results/…/results.xml (includes W3C annotation) • python cmhypy.py carvone/ -u petermr <key> send annotations -> hypothes.is
  15. 15. Annotation (entity in context) prefix surface label location suffix
  16. 16. Annotation sent to hypothes.is prefix suffix source user text uri maybe 100+ annotations per paper text
  17. 17. @Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism: "Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevi er-stopped-me-doing-my-research/ … #opencon #TDM Elsevier stopped me doing my research Chris Hartgerink
  18. 18. I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress. To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1]. In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers. Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day. Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university. I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research. [1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2 Chris Hartgerink’s blog post

×