Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High throughput mining of the plant-science literature

608 views

Published on

We can now mine the plant science literature for facts, especially species (both plants and others), chemicals, diseases and other agricultural terms. This presentation gives a number of examples and links on how you can do this on the Open Access literature

Published in: Science
  • Be the first to comment

  • Be the first to like this

High throughput mining of the plant-science literature

  1. 1. Mining science from the plant literature ContentMine Rothamsted Research, Harpenden, UK, 2016-09-12 Peter Murray-Rust [1]University of Cambridge [2]TheContentMine 5,000 scholarly publications every day. How many relate to plants?
  2. 2. Overview • Scholarly literature • Automation of downloading, normalization • Discipline-dependent semantics/ontology • Classification • Extraction • Annotation • Mining diagrams • Politics of mining
  3. 3. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011 http://contentmine.org
  4. 4. (2x digital music industry!)
  5. 5. Output of scholarly publishing [2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg 586,364 Crossref DOIs 201507 [1] /month 8000 papers/day 2.5 3 million (papers + supplemental data) /year each 3 mm thick  4500 m high per year [2] * Most is not Publicly readable [1] http://www.crossref.org/01company/crossref_indicators.html
  6. 6. What is “Content”? http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113 03&representation=PDF CC-BY SECTIONS MAPS TABLES CHEMISTRY TEXT MATH contentmine.org tackles these
  7. 7. MozFest 2015 ContentMine + TGAC / hack
  8. 8. Terpinome Phytochemists! Salvia officinalis Salvia microphylla Origanum vulgare Ocimum basilicum Laurus nobilis [1] [1] Lauraceae
  9. 9. We can search for • Plants • Compounds • Other species • Diseases • Frequent terms • We’ll need: sources, dictionaries, software
  10. 10. Europe PubMedCentral Over 1 million biomedical papers
  11. 11. Dictionaries! Diseases (WHO)
  12. 12. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 100, 000 pages/day Semantic ScholarlyHTML (W3C community group) Facts Latest 20150908 CONTENTMINE SOFTWARE Crossref
  13. 13. What plants produce Carvone? https://en.wikipedia.org/wiki/Carvone https://en.wikipedia.org/wiki/Carvone
  14. 14. Mining for phytochemicals • getpapers –q carvone –o carvone –x –k 100 Search “carvone”, output to carvone/, fmt XML, limit 100 hits • cmine carvone Normalize papers; search locally for species, sequences, diseases, drugs Results in dataTables.html and results/…/results.xml (includes W3C annotation) • python cmhypy.py carvone/ -u petermr <key> send annotations -> hypothes.is
  15. 15. Search for carvone
  16. 16. https://en.wikipedia.org/wiki/Carvone WIKIDATA
  17. 17. Carvone in Wikidata Also SPARQL endpoint WP identifier Chemical type Chemical identifier
  18. 18. ARTICLES FACETS gene disease drug Phyto chem species genus words
  19. 19. Suggest the title of this article
  20. 20. species words drug Phytochemdisease
  21. 21. species words drug Phytochemdisease disease
  22. 22. Annotation (entity in context) prefix surface label location suffix
  23. 23. Annotation with Hypothes.is Original publication “on publisher’s site” Annotation “on Hypothes.is site”
  24. 24. Amanuens.is Hypothes.is link Hypothes.is markup of article
  25. 25. http://chemicaltagger.ch.cam.ac.uk/ • Typical Typical chemical synthesis
  26. 26. Automatic semantic markup of chemistry Could be used for analytical, crystallization, etc.
  27. 27. Automatic extraction of plant species from the literature Lars Willighagen, ContentMine Fellow 2016, NL https://larsgw.github.io/contentmine- fellowship/html/card_c03-d.html
  28. 28. Mining diagrams
  29. 29. Ln Bacterial load per fly 11.5 11.0 10.5 10.0 9.5 9.0 6.5 6.0 Days post—infection 0 1 2 3 4 5 Bitmap Image and Tesseract OCR
  30. 30. “Root”
  31. 31. OCR (Tesseract) Norma (imageanalysis) (((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga _maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat um:167):217):11):9); Semantic re-usable/computable output (ca 4 secs/image)
  32. 32. Supertree created from 4300 papers
  33. 33. C) What’s the problem with this spectrum? Org. Lett., 2011, 13 (15), pp 4084–4087 Original thanks to ChemBark
  34. 34. After AMI2 processing….. … AMI2 has detected a square
  35. 35. https://contentmine-demo.herokuapp.com/ ContentMine data visualizations, Chris Kittel
  36. 36. https://contentmine-demo.herokuapp.com/trending 1 month , commonest disease terms
  37. 37. Terms from dictionaries
  38. 38. Co-ocurrence of gene names in same sentence
  39. 39. https://zenodo.org/record/61334#.V9XKT4XerCk
  40. 40. Systematic Reviews Can we: • eliminate true negatives automatically? • extract data from formulaic language? • mine diagrams? • Annotate existing sources? • forward-reference clinical trials?
  41. 41. Polly has 20 seconds to read this paper… …and 10,000 more
  42. 42. ContentMine software can do this in a few minutes Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
  43. 43. 400,000 Clinical Trials In 10 government registries Mapping trials => papers http://www.trialsjournal.com/content/16/1/80 2009 => 2015. What’s happened in last 6 years?? Search the whole scientific literature For “2009-0100068-41”
  44. 44. (2x digital music industry!) Contentmine.org Non-profit Collaborations include: • University of Cambridge Plant Sciences • TGAC/Open Plant • EuropePMC • Wikimedia • Some publishers

×