Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Workshop 5: Uptake of, and concepts in text and data mining


Published on

Workshop given at ELPUB2018 at the University of Toronto.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Workshop 5: Uptake of, and concepts in text and data mining

  1. 1. Content Mining Treating literature as data, dredging the epic vastness for useful information Dr Ross Mounce @rmounce
  2. 2. About Me Director of Open Access Programmes Before that, a Postdoc at a Fellow of the ('Class of 2016')
  3. 3. What is Content Mining?  Large-scale computer-aided information extraction from content  Content can be text, numerical data, static images such as photographs, videos, audio, metadata or any digital information, and/or a combination of them all  I prefer CM to ‘TDM’ because it highlights the diversity of content, not just text & data
  4. 4. 114,000,000 scholarly papers available online 36,000,000 of which are ‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’ Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
  5. 5. Think meta
  6. 6. Example #1  Recomputing statistical tests.  Did the authors report the result accurately/correctly? Chris Hartgerink, Tilburg University (NL)
  7. 7. Source: Hartgerink et al. (2016) Distributions of p-values smaller than .05 in psychology: what is going on? PeerJ >30,000 papers downloaded, 54% of which p-value information could be extracted from N=247 0 “67.45% of p-values reported as .05 were larger than .05 when recalculated”
  8. 8. 688,112 Statistical Results: Content Mining Psychology Articles for Statistical Test Results Data 2016, 1(3), 14; doi:10.3390/data1030014
  9. 9. Example #2  What has been published recently using specimens from  The Natural History Museum, London? “Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus rodricensis (BMNH.; …” NHM Data Portal Link (Stable, Unique Identifier) Research Article DOI (Stable, Unique Identifier)
  10. 10. NHM-specimens
  11. 11. Source: © The Trustees of the Natural History Museum, London
  12. 12.  New  Open Data  Easy-to-use  Quick  Images  Audio  Interactive Maps  Citable  API access  Open Source Infrastructure It’s not KE Emu :)
  13. 13. Literature-to-Specimen Links Red circles indicate those not found in the official NHM data portal(!) Can we rebuild a better catalogue directly from the literature(?) Fossil mammal specimens found through content mining
  14. 14. Does published info make it back ‘home’ to the collections? BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi” I found it (by text mining) here: It’s now called: Petrochromis horiin. sp. , according to the paper. What mechanisms are there to update newer information back into the collection? Content mining could help keep collections data up-to-date!
  15. 15. Why write such descriptive papers in natural language? Keep data as data! The above was published in 2013(!)
  16. 16. Searching ALL full texts is not enough!!! A significant number of specimens are probably ‘hiding-out’ in supplementary data files of all sorts of formats. Google Scholar does not index SI Web of Science doesn’t either Nor does Scopus At scale, journal-held supplementary data files are the ‘darkest corners’ of science “Specimens were deposited in the collections of the California Academy of Sciences' Department of Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)” 10.1371/journal.pone.0104628
  17. 17. Be careful where you publish Web of Knowledge searches for “phylog*”, “phylogen*” or “phylogeny” or “cladis*” will not find you all published phylogenetic analyses Chapter 6: Optimal search strategies for finding phylogenetic data Mounce, R., 2013. Comparative Cladistics: Fossils, Morphological Data Partitions and Lost Branches in the Fossil Tree of Life. Thesis (Doctor of Philosophy (PhD)). Biology & Biochemistry. Mounce, R. 2015. Dark research: information content in many modern research papers is not easily discoverable online. PeerJ PrePrints
  18. 18. Example #3 Weevils and their host plants (Entity co-occurrence) oducing-fellow-guanyang-zhang-mining-weevil-plant -associations/ Guanyang Zhang, University of California Riverside
  19. 19. Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and no institution in the world has access to everything. Not even close to everything!
  20. 20. In 2016, libraries pay subscriptions, or individuals per article fees to access even out of copyright works ??
  21. 21. Some academic societies recognise the value of releasing out-of-copyright content
  22. 22. This is what a PDF looks like PDF is NOT a good method of exchanging information
  23. 23. HTML is better, but lacks standardisation + italics & bold preserved, semantic links to figures & tables - lacks standardisation
  24. 24. The industry standard format for scholarly articles is JATS XML  Journal Article Tags Archiving Suite  is an application of NISO Z39.96-2015, which defines a set of XML elements and attributes for tagging journal articles  Standardising the format of digital scholarly publications is HIGHLY desirable  e.g. for this project, knowing if the string 'NHM' occurrs in the Materials section, rather than the Acknowledgements section is hugely helpful.  Much harder to do with PDF/HTML.  Section-based search already implemented in EuropePMC!  → Section level search functionality in Europe PMC. Kafkas et al (2015) J Biomed Semantics
  25. 25. A plea for full text XML A minority of journals do not provide full text XML ✓PLOS, eLife, PeerJ, Pensoft, Wiley, Elsevier, Springer, NPG, Ubiquity Press, Copernicus, Hindawi, MPDI ✘ Geological Society of London Publications, Magnolia Press, a long tail of smaller publishers
  26. 26. Most scholarly communications output would get ZERO (not available online), or just ONE star
  27. 27. getpapers quickscrape norma User-supplied local files Academic Journal websites
  28. 28. Live demo / video time! Large-scale content acquisition with getpapers & quickscrape Long version: Short version:
  29. 29. Image credit: Ubiquity Press UK Copyright Law has changed recently (2014), giving a specific copyright exemption for non- commercial text and data mining work
  30. 30. A complicated, fragmented landscape of relevant journals Nature + Science + PNAS + Phytotaxa + Zootaxa BioOne Journals (131) Springer Journals (32) Wiley Journals (22) Taylor & Francis Journals (14) Elsevier Journals (12) Oxford University Press Journals (8) SciELO Journals (7) [Open Access but not in PMC] Ecological Society of America Journals (6) Geological Society Journals (4) CSIRO Journals (4) Cambridge University Press Journals (3) Royal Society Journals (2) Journal-omics!
  31. 31. I discover 'new' journals every week e.g. last week I 'found' Oryctos (published between 1998-2010), still behind a paywall. Does anyone have access to this journal? Please let me know How are we meant to achieve a comprehensive aggregation of research literature (to do rigorous science, inclusive of all the evidence) when it is so unhelpfully scattered and we don't even know where it all is?
  32. 32. * new * Crossref Metadata Participation Reports A publisher-level overview of Crossref metadata completeness (of Metadata submitted to Crossref)
  33. 33. Sadly, quite a few university presses, UTPress included do not provide much metadata to Crossref I would like to see this change...(!)
  34. 34. SpringerAPI + FORMIS: Manually-curated ant paper abstracts(only) from papers published upto 1996
  35. 35. /^10.d{4,9}/[-._;()/:A-Z0-9]+$/i Regular expressions
  36. 36. This is for downloading/updating/maintaining a repository of all PLOS XML article files. This can be used to have a copy of the PLOS text corpus for further analysis. Use this program to download all PLOS XML article files instead of doing web scraping. allofplos – a smart tool for maintaining a corpus Please don’t run this on the conference WiFi, all of PLOS is > 4.4GB (!!!)
  37. 37.
  38. 38. Acknowledgements Sincere thanks to: Aime Rankin for help with the project The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining Nancy Chillingsworth (IPR, NHM London) Mark Wilkinson (Life Sciences, NHM London) Peter Murray-Rust & the ContentMine team Vince Smith (Life Sciences, NHM London) Ben Scott (NHM Data Portal Lead Architect) Rod Page (University of Glasgow) All of the Biodiversity Informatics team For a more detailed version of this talk on YouTube see:
  39. 39. So… Why aren't more researchers using text and data mining?