Plosslides

The Avoidable Waste of Scholarly Publishing
Peter Murray-Rust*,
ContentMine.org and the University of Cambridge
PLoS, Cambridge, UK 2015-07-09
Scholarly Publishing un/wittingly destroys huge amounts of publicly
funded research.
There are solutions; what is needed is will

Background
• Contentmine aims to make large areas of scientific fact OPEN (100
million facts/year)
• We’re working with WellcomeTrust, Europe PubMedCentral, etc.
• A politically “hot” area (Hargreaves legislation, EU activity)
• 2015 WellcomeTrust workshop on TDM and Neuroscience; “rough
consensus” on what was needed.
• Day workshop at Cochrane, UK (Amy Price, Anna Noel Storr, Ben
Goldacre)
• 2-day workshop at Edinburgh on Systematic Reviews of Animal Test
publications
• In the last few months we’ve prototyped a unique Open starting
point, continuously released.
• Can PLoS and ContentMine find constructive ways forward?

PM-R’s “first real paper”, doing science by
re-using the results of otherts in a novel way

1974:
Each point represented 1-4 hours
in library – discovery, volume delivery,
Transcription, hand calculation.

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]: “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.” In the future,
the authors asserted, “medical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired
infection.
Adage in public health: “The road to inaction is paved with research
papers.”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing

MONROVIA, Liberia — The conventional
wisdom among public health authorities is
that the Ebola virus, which killed at least
10,000 people in Liberia, Sierra Leone and
Guinea, was a new phenomenon, not seen in
West Africa before 2013. (The one exception
was an anomalous case in Ivory Coast in 1994,
when a Swiss primatologist was infected after
performing an autopsy on a chimpanzee.)
The conventional wisdom is wrong. We were
stunned recently when we stumbled across an
article by European researchers in Annals of
Virology: “The results seem to indicate that
Liberia has to be included in the Ebola virus
endemic zone.” In the future, the authors
asserted, “medical personnel in Liberian health
centers should be aware of the possibility that
they may come across active cases and thus be
prepared to avoid nosocomial epidemics,”
referring to hospital-acquired infection.
As members of a team drafting Liberia’s Ebola
recovery plan last month, we systematically
reviewed the literature on Ebola surveillance
since the virus’s discovery in central Africa in
1976. We learned that the virologists who wrote
that report, who were from Germany, had
analyzed frozen blood samples taken in 1978 and
1979 from 433 Liberian citizens. They found that
26 (or 6 percent) had antibodies to the Ebola
virus.
Three other studies published in 1986
documented Ebola antibody prevalence rates of
10.6, 13.4 and 14 percent, respectively, in
northwestern Liberia, not far from its borders
with Sierra Leone and Guinea. These articles,
along with other forgotten reports from the
1980s on antibody prevalence in neighboring
Sierra Leone and Guinea, suggest the possibility
of what some call “sanctuary sites,” or
persistent, if latent, Ebola infection in humans.
Bernice Dahn is the chief medical officer of Liberia’s Ministry of Health, where Vera Mussah
is the director of county health services. Cameron Nutt is the Ebola response adviser to Dr.
Paul Farmer at the nonprofit group Partners in Health.

“Free” and “Open”
• "Free software is a matter of liberty, not price.
’free speech', not 'free beer'”. (R M Stallman)
• “A piece of data or content is open if anyone is
free to use, reuse, and redistribute it”
(OKFN)http://opendefinition.org/
• “open” (access) has multiple incompatible “definitions”. Major split
is “human eyeballs” vs copying and machine “reusability”
• “Open” is a marketing term for publishers, who frequently (often
deliberately) do not grant full Openness.
“Gratis” vs “Libre”

http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)

Scientific and Medical publication (STM)[+]
• World Citizens pay $400,000,000,000…
• … for research in 1,500,000 articles …
• … cost $300,000 each to create …
• … $7000 each to “publish” [*]…
• … $10,000,000,000 from academic libraries …
• … to “publishers” who forbid access to 99.9% of citizens of
the world …
• 85% of medical research is wasted (not published, badly
conceived, duplicated, …)
[+] Figures probably +- 50 %
[*] arXiV preprint server costs $7 USD per paper

• “creative use of these large data sets in the US health care sector
could generate more than $300bn in value per annum” [MGI,
McKinsey]
• Gartner Inc. has identified 'Big Data' and 'Next-Generation
Analytics' as two of the 'Top 10 Strategic Technologies' for 2012.
• Given the volume of text generated by business, academic and
social activities – in for example competitor reports, research
publications or customer opinions on social networking sites – text
mining is, however, highly important. [JISC]
• there are some tasks that simply could not be achieved without
using text mining. For example, a major pharmaceutical company
used text mining tools to evaluate 50,000 patents in 18 months.
This would have taken 50 person years to achieve manually,
meaning that it would not even have been contemplated. [JISC]
“Big Data – and Analytics (ContentMining)

Prof. Ian Hargreaves (2011): "David Cameron's
exam question”: "Could it be true that laws
designed more than three centuries ago with the
express purpose of creating economic incentives
for innovation by protecting creators' rights are
today obstructing innovation and economic
growth?”
“yes. We have found that the UK's intellectual
property framework, especially with regard to
copyright, is falling behind what is needed.” "Digital
Opportunity" by Prof Ian Hargreaves - http://www.ipo.gov.uk/ipreview.htm. Licensed under CC BY 3.0 via Wikipedia -
https://en.wikipedia.org/wiki/File:Digital_Opportunity.jpg#/media/File:Digital_Opportunity.jpg

PUBLISHER TDM LICENCE INITIATIVES
GENERALLY DO NOT HELP
• Publishers have started offering their own TDM licences and policies
• Their licences often impose unfair (and in the case of the UK, unenforceable)
constraints on researchers’ freedom to exploit TDM, e.g., requiring users to
employ publisher’s API, putting unnecessary restrictions on how much can be
copied, or how fast it can be copied.
• Why “unenforceable”? Because, as noted earlier, UK law specifically states
that any contract or licence term that prevents anyone from doing TDM in the
manner prescribed in the new exception shall be deemed null and void.
• Really need a test case on these attempted restrictions.
• Springer and Royal Society offer generous TDM provisions.
• So why are so many publishers offering restrictive licences in the UK? Maybe
they hope licensees are ignorant of the strength of the new law, or the
publishers in fact don’t know about it. So they are either deliberately
misleading, or ignorant
Prof Charles Oppenheim and contentmine.org

Elsevier wants to control Open Data
[asked by Michelle Brook]

Front. Pharmacol., 03 October 2011 |
http://dx.doi.org/10.3389/fphar.2011.00051

How “data” are published in the 21st C

http://drugmonkey.scientopia.org/2010/08/11/yay-j-neuroscience-agrees-with-me-that-
supplementary-materials-is-bs-and-ruining-science/
w00000t!!!!1111!!!!ELEVEN!!!!
YAYAYAYAYAYAY!!!! Damn
tootin'!!!!!
Supplemental material also
undermines the concept of a
self-contained research report
by providing a place for critical
material to get lost. Methods
that are essential for replicating
the experiments, analyses that
are central to validating the
results, and awkward
observations are increasingly
being relegated to supplemental
material. Such material is not
supplemental and belongs in the
body of the article, but authors
can be tempted (or, with some
journals, encouraged) to place
essential article components in
the supplemental material.

catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts

Regular Expressions for Systematic Reviews of Animal Tests
Preceding Text
Following Text
Extracted term
Today’s Results!! We searched papers for 200 regex-based
Terms and got ca 100 hits per paper

Questions we can tackle
• How to we find (mentions of) clinical/animal trials?
• Is a document a trial?
• What is the subject of the trial?
• What is the methodology used?
• Does the design and practice conform to
CONSORT/ARRIVE?
• What are the outcomes?
• Can we extract specific re-usable information?
• Who are involved? (researchers, sponsors, patients?)
• Has a proposed trial been completed and reported?

Linked Open Data – the world’s knowledge
very little physical science and THESES?? 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples

The Right to Read is the Right to Mine
http://contentmine.org

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-
enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT

What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these

PLoSONE BMC
1
BMC
2
Closed1 Closed2Hybrid
CATalog
Enhanced annotated
articles
FACTSFACTS
Daily Crawl
Crawl … Scrape … Normalize … Mine
Linked OpenData
Semantic
Scientific Objects
2000-5000
Articles

Machine-Human symbioses
• Wikipedia
• Open StreetMap
• Google
We aim to make it trivial for a human+machine
to mine the scientific literature.
By building Communities

ContentMine Workshops and
Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application
in a morning
Start simple: bagOfWords, Stemming, Regex, templates

Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist

Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI

Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
• JISC , London
Upcoming
• LIBER
• Cochrane
• BL
• Wellcome Trust (April)
• WHO
Collaborators
• Wikimedia/Wikidata
• Mozilla
• Open Knowledge
• LIBER (European Research Libraries)
• British Library
• Wellcome Trust
• EBI (Eur. Bioinf. Inst.)
• JISC
• Open Access Button
• SPARC
• Creative Commons
• CORE
• EuropePubmedCentral

• CRAWL the web for scientific documents
(articles, grey literature, repositories)
• quickSCRAPE pages (text, graphics, images, data)
• NORMA-lize page to semantic form
…Open semantic science …
• MINE pages with your methods and tools (AMI)
• CAT-alogue results in searchable index
• Automate daily process (CANARY)
contentmine.org Infrastructure

quickscrape
Crawl
Feed
Norma Index &
Transform
TXT
XML
URL
DOI
Scientific
literature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific
Literature + Facts
CANARY pipeline
CAT-alogue index
PDF

https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg
CRAWLing the Literature
NO Central Table of Contents
Massive technical, political, legal opposition
Little interest from Academia
Tedious
Few general tools

The Right to Read is The Right To Mine
PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/

SCRAPE
https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain
PDF
HTML
XML quickscrape*
*Scrapers created by
Richard Smith-Unna +
Community
HTML
PDF
XML
PNG
SVG
CSV
DOC
LaTeX
CIF
…
Non-standard per-publisher site

https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain
NORMA-lization of Scientific Literature
PDFs, Broken HTML
PNGs for Math, etc.
NORMA
Unicode
Diacritics
Well-formed
Sectioned
Tagged
SVG diagrams

AMI-plugins
• BagOfWords, Stemming and Regular Expressions
• Species
• Biological Sequences
• Chemical compounds & reactions
• Farming * (Rory Aaronson)
• Crystallography * (Saulius Grazulis, COD)
• Clinical Trials * (Amy Price)
• Phylogenetics * (Ross Mounce)
• Phytochemistry * (Chris Steinbeck, PMR)
* subcommunities

Text-based plugins
• Bag of words
(https://en.wikipedia.org/wiki/Bag-of-
words_model)
• https://en.wikipedia.org/wiki/Tf%E2%80%93idf
(Term-frequency, inverse document frequency)
• Templates and regexes (regular expressions).

“Bag of Words”
Three fulltext articles from trialsjournal.com

Regular Expressions for Systematic Reviews of Animal Tests
Preceding Text
Following Text
Extracted term

“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places

http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis

Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.

Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR

UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points

Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Smoothing
Gaussian Filter
Automatic
extraction

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-
mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick

Peter
Murray-Rust
BMC publisher
Blue Obelisk paper (20
co-authors)
Sub-network
From CATalog

Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)

What we can do
• Recognize and promote autonomous sub-
communities
• Engage Early Career Researchers, including
undergraduates and let THEM BUILD the
systems.
• COMMUNALLY build tools for data checking
• Insist on semantic data input, even if it costs
submissions

Plosslides

More Related Content

What's hot

Viewers also liked

Similar to Plosslides

More from petermurrayrust

Recently uploaded

Plosslides

Editor's Notes