Digital Scholarship: Enlightenment or Devastated Landscape?

Digital Scholarship: Enlightenment or
Devastated Landscape?
Peter Murray-Rust,
University of Cambridge
IT Future Conference, Informatics Forum, Edinburgh, UK 2015-12-17
(Glen Feshie, remains of forest, CC-BY-SA 2.0 Ian Shiell http://www.geograph.org/uk/photo/3944612.jpg )

University of Stirling 1972
student occupations and sit-ins
University of Stirling
Used without permission but with thanks and Love
Liverpool , Warwick, Emmanuel Coll Camb., UCL, Glasgow, Middlesex, …
Peter Murray-Rust,
Lecturer

Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] per month
>2.5 million (papers + supplemental data) /year*
 4500 m high per year [2]
 Representing ? 500 Billion USD public funding
[1] http://www.crossref.org/01company/crossref_indicators.html

Refs: Erriquez_Daniela_tesi, Fiorentina_Elena_tesi, Gou_Qian_Tesi, mbarontini_tesid, terracciano_maria_tesi
BagOfWords for Italian Theses

http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis

Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.

What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these

https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA

“Root”
4500 papers each
with 1 tree

OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga
_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te
rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat
um:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)

Supertree for 924 species
Tree

Supertree created from 4300 papers

Systematic reviews of the
Neuroscience literature:
• 30,000 papers in 1 year
• Extraction of data from graphs
Malcolm Macleod, Professor of Neurology and
Translational Neuroscience at the Centre for
Clinical Brain Sciences, University of Edinburgh,
with ContentMine 2015

UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points

Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Smoothing
Gaussian Filter
Automatic
extraction

Polly has 20 seconds to read this paper…
…and 10,000 more

ContentMine software can cut the effort by 50%
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”

ContentMine Tools*
http://iucn.contentmine.org (endangered species)
http://fotd.contentmine.org (fact of the day)
http://bubbles.contentmine.org (network analysis of
papers)
*Dr. Mark MacGillivray, Informatics Forum, University of Edinburgh

Fact of the Day
• http://fotd.contentmine.co/?s=daily20151209
(images from https://en.wikipedia.org/wiki/Caenorhabditis_elegans CC-BY-SA)

Facts in context
daily IUCN endangered species news
en.wikipedia.org CC By-SA

http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [digital
scholarship] by all scientists, scholars, teachers,
students, and other curious minds. …
…share the learning of the rich with the poor and the
poor with the rich, … and lay the foundation for
uniting humanity in a common intellectual
conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)

DNADigest + ContentMine looking for DNA datasets in the literature
European Bioinformatics Institute, 2015-12-11

C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark

After AMI2 processing…..
… AMI2 has detected a square

Chris Hartgerink, University of Tilburg
I am a statistician interested in detecting potentially
problematic research such as data fabrication, which
results in unreliable findings and can harm policy-making,
confound funding decisions, and hampers research
progress.
…I am content mining results reported in the psychology
literature

I am a statistician interested in detecting potentially problematic research such as data fabrication,
which results in unreliable findings and can harm policy-making, confound funding decisions, and
hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the
literature is a valuable avenue of investigating research questions with innovative methods. For
example, our research group has written an automated program to mine research papers for errors in
the reported results and found that 1/8 papers (of 30,000) contains at least one result that could
directly influence the substantive conclusion [1].
In new research, I am trying to extract test results, figures, tables, and other information reported in
papers throughout the majority of the psychology literature. As such, I need the research papers
published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research
papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account
potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention
to redistribute the downloaded materials, had legal access to them because my university pays a
subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days.
This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified
my university that this was a violation of the access contract, that this could be considered stealing of
content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading
(which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly
hampering me in my research.
[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22.
doi: 10.3758/s13428-015-0664-2
Chris Hartgerink’s blog post
“Elsevier stopped me doing my research”

The Right to Read
is
The Right to Roam
The Right to Mine
Kinder Mass Trespass
used without permission but with love and thanks

The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org

2014 UK “Hargreaves” reform

Proposed
amendment after
publisher lobbying
Julia Reda’s report

STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: we have NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
Licences destroy Content Mining

Julia Reda MEP
Julia Reda MEP
The current copyright regime is undermining our ability
to produce evidence. It is time that academics in large
numbers … speak up about this issue. Decreasing the very
substantial burdens and transaction costs for research and
education is one of the declared goals of the Commission’s
copyright reform proposal, and the European Parliament has
echoed that sentiment in my report.
Prof Ian Hargreaves:
…make sure that the voices of the digital many
are not drowned out in policy discussions by
the digitally self-interested few.
http://www.create.ac.uk/blog/2015/09/16/epip2015-opening-keynote-response-
transcript/
there’s a serious risk
of Europe digging
itself deeper into a
digital black hole on
copyright,

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]*: “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.” In the future,
the authors asserted, “medical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired
infection.
*Still behind a 35USD paywall
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing

[1] The Military-Industrial-Academic complex (1961)
(Dwight D Eisenhower, US President)
Publishers Academia
Glory+?
$$, MS
review
Taxpayer
Student
Researcher
$$ $$
in-kind
The Publisher-Academic complex[1]

Panton Principles for Open Scientific Data
Jenny Molloy
Ross Mounce
Sam Moore Peter Kraker Rosie GraySophie Kay
PANTON ARMS
Panton Fellows
CC02010
http://pantonprinciples.org/about/

Elsevier wants to control Open Data
[asked by Michelle Brook]

Scholarly infrastructure becomes closed
No accountability for monitoring and control

Thanks to some Children
of the Digital Enlightenment
• David Carroll & Joe McArthur: OAButton
• Rayna Stamboliyska & Pierre-Carl Langlais
• Jon Tennant
• Ross Mounce
• Jenny Molloy
• Erin McKiernan
• Jack Andraka
• Michelle Brook
• Heather Piwowar
• TheContentMine Team
• Mark MacGillivray
• Rufus Pollock
• Jonathan Gray
• Sophie Kay
• Aaron Swartz
• Chris Hartgerink
Jean-Claude Bradley [1] a chemist
developed Open notebook science;
making the entire primary record of a
research project publicly available
online as it is recorded. (WP)
J-C promoted these ideas with
UNDERGRADUATE scientists.
[1] Unfortunately J-C died in 2014;
we held a memorial meeting in
Cambridge
Sophie
Kay

Discussion
• Let’s concentrate on what we can do to create
positive change, rather than explain why we
can’t do anything.*
• [1] “It’s not our fault, it’s (a) librarians (b) researchers (c) publishers (d) funders (e)
governments (f) scholarly societies (g) principals/Vice-chancellors … “

Digital Scholarship: Enlightenment or Devastated Landscape?

Digital Scholarship: Enlightenment or Devastated Landscape?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Digital Scholarship: Enlightenment or Devastated Landscape?

Similar to Digital Scholarship: Enlightenment or Devastated Landscape? (20)

More from TheContentMine

More from TheContentMine (16)

Recently uploaded

Recently uploaded (20)

Digital Scholarship: Enlightenment or Devastated Landscape?

Editor's Notes