Open Access to Scientific Knowledge

MRC Cognition and Brain
Sciences Unit, Cambridge,
UK, 2018-11-20
Open Scientific Knowledge
Peter Murray-Rust
TheContentMine and
Dept of Chemistry , Univ of Cambridge
A new knowledgebase beyond journals
Images from ContentMine CC BY and Wikimedia CC BY-SA
pm286@cam.ac.uk
peter@contentmine.org

Tux and GNU: Open and Free Heroes
This is a story of liberation You can be part of it
And it will make life easier for you and citizens everywhere
TUX Linux GNU FSF Might be
controversial

OurOur story is
In 3 ACTS
Our
We’ll show
you WHY we
need OPEN
Our
Then a DEMO
of SOFTWARE
getpapers and
AMI
Our
Building
COMMUNITY
We need YOU/US
Structure of the presentation
Rapidly reading the literature and supporting systematic reviews

Sustainable Open?
We need volunteers And a sustainable organization
Cannot be bought commercially, 501(c)3, OpenLock
SSI,
Numfocus

Rik Smith-Unna
PlantSciences Cambridge
ContentMine
CoKo
WorldBrain
getpapers and quickscrape

(2x digital music industry!)
ContentMine is OpenLocked Non-Profit http://contentmine.org
The Right to Read is the Right to Mine

The problem: publishers control the infrastructure
Sucking money
out of the system
And destroying science in the Global South…
*In Fahrenheit 451 firemen burned books; in C21st publishers restrict knowledge
Completely
unregulated industry
Megapub451*

cc by-nc-sa license LabHack and Alliance Earth
1 APC = 1900 USD
1 bioreactor = 25 USD
1 Raspberry PI 55 USD
1 submission to bioRxiv
Free (10 USD hidden)
“a PCR machine in the UK
is around £6000 but in
Zimbabwe about $33000 -
try convincing someone to
pay APCs when they have
to try and save for that.”
CITIZENS!
Zimbabwe. LabHack team from
Harare Institute of Technology.

Scientific knowledge should be totally free
OKFN
GNU
TUX/Linux
ContentMine
What’s “surveillance capitalism”?
No Surveillance
capitalism
Innovative reuse
Of content. No ©

@Senficon (Julia Reda) :Text & Data mining in times of
#copyright maximalism:
"Elsevier stopped me doing my research"
http://onsnetwork.org/chartgerink/2015/11/16/elsevi
er-stopped-me-doing-my-research/ … #opencon #TDM
Elsevier stopped me doing my research
Chris Hartgerink

I am a statistician interested in detecting potentially problematic research such as data fabrication,
which results in unreliable findings and can harm policy-making, confound funding decisions, and
hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the
literature is a valuable avenue of investigating research questions with innovative methods. For
example, our research group has written an automated program to mine research papers for errors in
the reported results and found that 1/8 papers (of 30,000) contains at least one result that could
directly influence the substantive conclusion [1].
In new research, I am trying to extract test results, figures, tables, and other information reported in
papers throughout the majority of the psychology literature. As such, I need the research papers
published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research
papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account
potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention
to redistribute the downloaded materials, had legal access to them because my university pays a
subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days.
This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified my
university that this was a violation of the access contract, that this could be considered stealing of
content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading
(which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly
hampering me in my research.
[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22.
doi: 10.3758/s13428-015-0664-2
Chris Hartgerink’s blog post

It costs 10 USD to mount an article on (bio)arXiv…
So why 2000 USD for a megapub451 article?

I can charge whatever I like!! No regulator!
academics pay – it’s not their money – and they get glory

APCs and Journals MUST GO!
arXiv
bioRxiv
chemRxiv
10$
Commercial publisher
1800$
Review
Production
Hosting
Corporate
Branding
Marketing
philanthropy
Shareholder
Profit

Scientific knowledge saves lives
But closed is costing us dearly …

…so closed access means people die…
…The software will demonstrate how we can search in future …
I’m from Congo where Ebola comes
from. The Liberia outbreak
Was predicted 30 years ago in a
paywalled paper

Semantic Fulltext
• EuropePMC coherent OpenAccess
• getpapers: query , download (through API).
• AMI filters, checks[1], transforms facts in papers.
• sequences, species, genera, genes,
dictionaries
[0] All operations shown run in total of <3 minutes.
[1] Dictionaries and lookup.
[2] Usable from home by anyone
Zika endemic areas
Wikimedia CC-BY-SA

Open Components
• All the literature – free FULLTEXT everywhere
• Universal dictionary
• Open software – modular
• FRICTIONLESS – no gatekeepers
• CC BY, CC0, BSD/MIT/Apache/GNU,

PREPRINTS!!
Crossref
EuropePMC
Wikidata
getpapers
AMI
We can change all that!
We can do everything ourselves! Look … demo

https://www.wikidata.org/wiki/Wikidata:WikiFactMine
ContentMine thanks the WikimediaFoundation for support
15 million articles, over 200 dictionaries

All the world’s 5 million FAIR Open Scientific articles (* 0.1 MB = 0.5 TB),
indexed by ContentMine . Disk 30 GBP Raspberry Pi3. 50 GBP
CC BY, PeterMR
Disk
Raspberry PI
Power

*** getpapers runs FAST! Downloads 50 papers /
sec => 3000 / min => 200,000 /hour
*** AMI-search:
Dictionaries based on anything in Wikidata (50
million items!) or your own.
We show country, brainparts, funders, disease…
looking for feedback, volunteers, examples

OpenNotebookScience
Jean-Claude Bradley presented with BlueObelisk
by Egon Willighagen

DEMO!!
(a) What is “neuroimaging”??
getpapers –q “neuroimaging” –x –k 100
–o neuro;
ami-search-cooccur neuro
country disease funders
(b) What does the MRC unit do?
getpapers –q “MRC Cognition and Brain
Studies Unit” –x –k 2000 –o cbsu;
ami-search-cooccur neuro
country brainparts braincognition
funders animaltesting

ECR communities we work with
• Open MOOC (Jon Tennant)
• OpenKnowledge Maps (Peter Kraker)
• Unpaywall (Heather Piwowar)
• World brain (Oli Sauter)
• And ContentMine Fellows
• Alexandra Bannach-Brown (Edinburgh, Bond)
(neuroscience and animal experiments)
• And …

AMI-Bio Proposal to Mozilla
We invite you to submit a full application
for AMI-bio: Citizen search and use of the
biomedical literature - Request ID number MF-
1811-05957.
Please submit your application by 11/30/2018.

Guanyang Zhang
 Biology, Arizona
 „My ContentMine Fellowship project will focus on mining weevil-plant associations from literature
records.“
 „Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils
(Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly
5% of all known animals.“
 „Knowledge of host plant associations is critical for pest management, conservation, and
comparative biological research. This knowledge is, however, scattered in 300 years of historical
literature and difficult to access.“
 Weevil-plant association network graph made with Google Fusion Table. Each blue circle is a weevil
tribe and yellow circle a plant genus. The size of a circle represents the number of associations.

Neo Christopher Chung
 Warsaw, Computational Biology
 Wants to find out geographic and temporal differences in the use of genomic software tools

ContentMine Workshops on Mining
Chris Kittel, CM, atMozfest 2015
Stefan Kasberger, CM

Julia Reda, Pirate MEP, running ContentMine
software to liberate science 2016-04-16

Lars Willighagen
 15 years old NL
 Wants: extract data about conifers (relations to chemicals, height etc.)
 Outcome: database with webpage containing conifer properties
 Table Facts Visualiser DEMO
 Card DEMO
 Word Cloud
 „ I applied to this fellowship to learn new things and combine the ContentMine with two previous
projects I never got to finish, and I got really excited by the idea and the ContentMine at large.“

bioRxiv in
Citizen Health Search (CHS)
A proposal to Wellcome Trust (
Open Research in Health call) with
ContentMine, Cochrane and UCL-EPPI (CCU)
CHS puts semantic search on the desktop
of the searcher. We index all the visible
Medical literature, normalize, section
and index against a bank of user-chosen
dictionaries.
CHS takes input from EPMC, bioRxiv and
emerging community sources such as
Crossref, unpaywall and outputs to Zenodo,
Wikidata and CM-Science Source.
Citizen Dashboard

Question/s
• “How can I help?”
– Create dictionaries
– Document your voyage
– Spread the word
– Advocate
– Meet at the pub for hacking?
– Code (especially downstream - visualisation)
?Anyone seriously interested in automatic extraction of
data from tables and plots?

http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)

Open Access to Scientific Knowledge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Open Access to Scientific Knowledge

Similar to Open Access to Scientific Knowledge (20)

More from petermurrayrust

More from petermurrayrust (17)

Recently uploaded

Recently uploaded (20)

Open Access to Scientific Knowledge