Workshop 5: Uptake of, and concepts in text and data mining
Treating literature as data,
dredging the epic vastness
for useful information
Dr Ross Mounce
Director of Open Access
Before that, a Postdoc at
a Fellow of the
('Class of 2016')
What is Content Mining?
information extraction from content
Content can be text, numerical data,
static images such as photographs,
videos, audio, metadata or any
digital information, and/or a
combination of them all
I prefer CM to ‘TDM’ because it
highlights the diversity of content,
not just text & data
scholarly papers available online
36,000,000 of which are
‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’
Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
Recomputing statistical tests.
Did the authors report the result
Tilburg University (NL)
Source: Hartgerink et al. (2016) Distributions of p-values smaller than .05 in psychology: what is going on? PeerJ https://doi.org/10.7717/peerj.1935
>30,000 papers downloaded, 54% of which p-value information could be extracted from
“67.45% of p-values reported as .05 were larger than .05 when recalculated”
688,112 Statistical Results: Content Mining Psychology Articles for Statistical Test Results
Data 2016, 1(3), 14; doi:10.3390/data1030014
What has been published recently using specimens from
The Natural History Museum, London?
“Micro-computed tomography scan slice through four bat skulls, displaying the relative position of
the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus
rodricensis (BMNH.126.96.36.199); …”
NHM Data Portal Link (Stable, Unique Identifier)
Research Article DOI (Stable, Unique Identifier)
It’s not KE Emu :)
Red circles indicate those not found
in the official NHM data portal(!)
Can we rebuild a better catalogue
directly from the literature(?)
Fossil mammal specimens found through content mining
Does published info make it back ‘home’ to the collections?
BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi”
I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9
It’s now called: Petrochromis horiin. sp. , according to the paper.
What mechanisms are there to update newer information back into the collection?
Content mining could help keep collections data up-to-date!
Why write such descriptive papers in natural
language? Keep data as data!
The above was published in 2013(!)
Searching ALL full texts is
A significant number of specimens are
probably ‘hiding-out’ in supplementary
data files of all sorts of formats.
Google Scholar does not index SI
Web of Science doesn’t either
Nor does Scopus
At scale, journal-held supplementary
data files are the ‘darkest corners’ of
“Specimens were deposited in the collections of the California Academy of Sciences' Department of
Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)”
Be careful where you publish
Web of Knowledge searches for “phylog*”, “phylogen*” or “phylogeny” or “cladis*”
will not find you all published phylogenetic analyses
Chapter 6: Optimal search strategies for finding phylogenetic data
Mounce, R., 2013. Comparative Cladistics: Fossils, Morphological Data Partitions and Lost Branches in
the Fossil Tree of Life. Thesis (Doctor of Philosophy (PhD)). Biology & Biochemistry.
Mounce, R. 2015. Dark research: information content in many modern research papers is not easily
discoverable online. PeerJ PrePrints http://dx.doi.org/10.7287/peerj.preprints.773v1
Weevils and their host plants (Entity co-occurrence)
University of California Riverside
Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and no
institution in the world has access to everything. Not even close to everything!
In 2016, libraries pay subscriptions, or individuals per article fees
to access even out of copyright works
Some academic societies recognise the value of releasing
This is what a PDF looks like
PDF is NOT a
HTML is better, but lacks
+ italics & bold preserved, semantic links to figures & tables - lacks standardisation
The industry standard format for
scholarly articles is JATS XML
Journal Article Tags Archiving Suite
is an application of NISO Z39.96-2015, which defines a set of XML elements and
attributes for tagging journal articles
Standardising the format of digital scholarly publications is HIGHLY desirable
e.g. for this project, knowing if the string 'NHM' occurrs in the Materials section, rather
than the Acknowledgements section is hugely helpful.
Much harder to do with PDF/HTML.
Section-based search already implemented in EuropePMC!
→ Section level search functionality in Europe PMC. Kafkas et al (2015) J Biomed Semantics
A plea for full text XML
A minority of journals do not provide full text XML
✓PLOS, eLife, PeerJ, Pensoft, Wiley, Elsevier, Springer,
NPG, Ubiquity Press, Copernicus, Hindawi, MPDI
✘ Geological Society of London Publications,
Magnolia Press, a long tail of smaller publishers
Most scholarly communications output would get
ZERO (not available online), or just ONE star
Live demo / video time!
Large-scale content acquisition with
Long version: https://youtu.be/kiv_gxGp2IQ
Short version: https://youtu.be/H2ESPjihnDA
Image credit: Ubiquity Press
UK Copyright Law has
(2014), giving a
exemption for non-
commercial text and
data mining work
A complicated, fragmented landscape of relevant journals
Nature + Science + PNAS + Phytotaxa + Zootaxa
BioOne Journals (131)
Springer Journals (32)
Wiley Journals (22)
Taylor & Francis Journals (14)
Elsevier Journals (12)
Oxford University Press Journals (8)
SciELO Journals (7) [Open Access but not in PMC]
Ecological Society of America Journals (6)
Geological Society Journals (4)
CSIRO Journals (4)
Cambridge University Press Journals (3)
Royal Society Journals (2)
I discover 'new' journals every week
e.g. last week I 'found' Oryctos (published between 1998-2010),
still behind a paywall. Does anyone have access to this journal?
Please let me know
How are we meant to achieve a comprehensive
aggregation of research literature (to do rigorous
science, inclusive of all the evidence) when it is
so unhelpfully scattered and we don't even know
where it all is?
* new * Crossref Metadata Participation Reports
A publisher-level overview of Crossref
metadata completeness (of Metadata
submitted to Crossref)
Sadly, quite a few
UTPress included do not
provide much metadata to
I would like to see this
from papers published
This is for downloading/updating/maintaining a repository of all PLOS XML article files.
This can be used to have a copy of the PLOS text corpus for further analysis.
Use this program to download all PLOS XML article files instead of doing web scraping.
allofplos – a smart tool for maintaining a corpus
Please don’t run this on the conference WiFi, all of PLOS is > 4.4GB (!!!)
Sincere thanks to:
Aime Rankin for help with the project
The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining
Nancy Chillingsworth (IPR, NHM London)
Mark Wilkinson (Life Sciences, NHM London)
Peter Murray-Rust & the ContentMine team
Vince Smith (Life Sciences, NHM London)
Ben Scott (NHM Data Portal Lead Architect)
Rod Page (University of Glasgow)
All of the Biodiversity Informatics team
For a more detailed version of this talk on
YouTube see: bit.ly/nhmlink
So… Why aren't more researchers using text and data mining?