The scientific and medical literature is a vast resource of knowledge, but it needs turning into semantic FAIR form. The ContentMine can do this and we presented a rapid overview of the potential
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
Scientific search for everyone
1. SES 2018, London, UK,
2018-09-03
Scientific Search for Everyone
Peter Murray-Rust
TheContentMine
A new knowledgebase beyond journals
Images from ContentMine CC BY and Wikimedia CC BY-SA
pm286@cam.ac.uk
peter@contentmine.org
Why? How? Who?
2. (2x digital music industry!)
ContentMine is OpenLocked Non-Profit http://contentmine.org
The Right to Read is the Right to Mine
3. • Preprints, Unpaywall, Wikimedia, Repos,
ContentMine, … offer a new generation of
semantic (FAIR) science.
• Closed access means people die.
• Journal “publishing” divides the world
• Scholarship is for all Citizens
• I present a new generation of Citizen-based
search tools beyond Journals
4. http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
5. APCs and Journals MUST GO!
arXiv
bioRxiv
chemRxiv
10$
Commercial publisher
1800$
Review
Production
Hosting
Corporate
Branding
Marketing
philanthropy
Shareholder
Profit
6. Publisher A inviting PMR to be on EdBoard: …
no waiver for Global South or developing countries.
We encourage authors around the world to publish
high-quality papers … (1854 USD APC)
We devote some of fees to support the
academic development […] , e.g., making journal travel
award .. ; best
author/guest editor award for … special issue;
Why APCs must go
Publisher B answer to PMR on EdBoard: …
[We don’t review or validate data; we expect
reviewers to do that]
7. [1] The Military-Industrial-Academic complex (1961)
(Dwight D Eisenhower, US President)
Publishers Academia
Glory+?
$$, MS
review
Taxpayer
Student
Researcher
$$ $$
in-kind
The Publisher-Academic complex[1]
Infrastructure
“The scholarly poor”
8.
9. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]: “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.” In the future,
the authors asserted, “medical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired
infection.
Adage in public health: “The road to inaction is paved with research
papers.”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
10. Citizen Search (10 mins)
• Tropical disease
– ALL Open(“West Africa”)*(“Flavivirus”). EuropePMC
– Index with disease, insect, country, insecticide
• Ferromagnetism (arxiv)
– 50 recent PDFs
– Index magnetism, elements, crystallography
11. Also see http://freeourknowledge.org/
A recent initiative – the Fair Open Access Alliance – has developed a set of
progressive criteria for journals which focus on two main objectives: regaining
financial control of the academic publishing system, and supporting open access
principles. In summary, these are:
The journal has a transparent ownership structure, and is controlled by and
responsive to the scholarly community.
Authors of articles in the journal retain copyright.
All articles are published open access and an explicit open
access licence is used.
Submission and publication is not conditional in any way on
the payment of a fee from the author or its employing
institution, or on membership of an institution or society.
Any fees paid on behalf of the journal to publishers are low,
transparent, and in proportion to the work carried
out. see www.fairopenaccess.org
12. Beyond FAIR – FOAA and REACT
99% of journal articles are neither FAIR nor ethical.
Fair Open Access FOAA
PMR’s principles are similar
Readers
Equitable – unite the world
Affordable
Citizens
Transparent
13. cc by-nc-sa license LabHack and Alliance Earth
1 APC = 1900 USD
1 bioreactor = 25 USD
1 Raspberry PI 55 USD
1 submission to bioRxiv
Free (10 USD hidden)
“a PCR machine in the UK
is around £6000 but in
Zimbabwe about $33000 -
try convincing someone to
pay APCs when they have
to try and save for that.”
CITIZENS!
Zimbabwe. LabHack team from
Harare Institute of Technology.
14. Guanyang Zhang
Biology, Arizona
„My ContentMine Fellowship project will focus on mining weevil-plant associations from literature
records.“
„Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils
(Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly
5% of all known animals.“
„Knowledge of host plant associations is critical for pest management, conservation, and
comparative biological research. This knowledge is, however, scattered in 300 years of historical
literature and difficult to access.“
Weevil-plant association network graph made with Google Fusion Table. Each blue circle is a weevil
tribe and yellow circle a plant genus. The size of a circle represents the number of associations.
15. Neo Christopher Chung
Warsaw, Computational Biology
Wants to find out geographic and temporal differences in the use of genomic software tools
17. Julia Reda, Pirate MEP, running ContentMine
software to liberate science 2016-04-16
18. Lars Willighagen
15 years old NL
Wants: extract data about conifers (relations to chemicals, height etc.)
Outcome: database with webpage containing conifer properties
Table Facts Visualiser DEMO
Card DEMO
Word Cloud
„ I applied to this fellowship to learn new things and combine the ContentMine with two previous
projects I never got to finish, and I got really excited by the idea and the ContentMine at large.“
19. ContentMine goal
• Read 10,000 – 100,000 papers every day
• Make them semantic (where possible)
• Index against WikiData
• Extract semantic objects (chemistry,
computations, multivariate tables)
• Publish to Zenodo
• Aggregate, Filter, Mix with other domains
• MAKE SCIENCE AVAILABLE
You can help – if you want
20. Semantic Fulltext
• EuropePMC coherent OpenAccess
• getpapers: query , download (through API).
• AMI filters, checks[1], transforms facts in papers.
• sequences, species, genera, genes,
dictionaries
[0] All operations shown run in total of <3 minutes.
[1] Dictionaries and lookup.
[2] Usable from home by anyone
Zika endemic areas
Wikimedia CC-BY-SA
21. Commonest species in 120 Zika papers
423 Ae./Aedes aegypti
333 Ae./Aedes albopictus
63 Ae. bromeliae
58 Ae. lilii
46 Ae. hensilli
42 Glossina pallidipes
40 Plasmodium vivax
35 Ae. luteocephalus
28 Ae. vittatus
25 Ae. furcifer
22 Plasmodium falciparum
21 Drosophila melanogaster
pre=“fever (DHF), are caused by the world's most prevalent mosquito-borne virus.
37 DENV is carried by " exact="Aedes aegypti” post=" mosquito, which is strongly
affected by ecological and human drivers, but also influenced by clima" name="binomial"/>
22. Download all Open Access “Zika” from
EuropePMC in 10 seconds
(click below for movie)
Aedes aegypti, Wikimedia CC-BY-SA
Note: movies of this and other slides can be seen at https://vimeo.com/154705161
23. 3011 virus
1939 Ae./Aedes
1212 dengue
901 mosquito/es
894 species
791 ZIKV
721 using
716 DENV
567 detection
513 aegypti
484 infection
442 RNA
428 protein
401 albopictus
360 viral
Commonest words in 120 Zika papers
Mosquito spp.
Wikimedia CC-BY-SA
27. But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow
Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE
28. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-frame
animation showing how the 12
reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
31. Modern Diagram Mining
4500 separate images
Phylogenetic tree
supertree
A machine-compiled microbial
supertree from figure-mining
thousands of papers,
Ross Mounce, Peter Murray-
Rust, Matthew A Wills, 2017
https://riojournal.com/article/
13589/
33. @Senficon (Julia Reda) :Text & Data mining in times of
#copyright maximalism:
"Elsevier stopped me doing my research"
http://onsnetwork.org/chartgerink/2015/11/16/elsevi
er-stopped-me-doing-my-research/ … #opencon #TDM
Elsevier stopped me doing my research
Chris Hartgerink
34. I am a statistician interested in detecting potentially problematic research such as data fabrication,
which results in unreliable findings and can harm policy-making, confound funding decisions, and
hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the
literature is a valuable avenue of investigating research questions with innovative methods. For
example, our research group has written an automated program to mine research papers for errors in
the reported results and found that 1/8 papers (of 30,000) contains at least one result that could
directly influence the substantive conclusion [1].
In new research, I am trying to extract test results, figures, tables, and other information reported in
papers throughout the majority of the psychology literature. As such, I need the research papers
published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research
papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account
potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention
to redistribute the downloaded materials, had legal access to them because my university pays a
subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days.
This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified my
university that this was a violation of the access contract, that this could be considered stealing of
content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading
(which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly
hampering me in my research.
[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22.
doi: 10.3758/s13428-015-0664-2
Chris Hartgerink’s blog post
35. All the world’s 5 million FAIR Open Scientific articles (* 0.1 MB = 0.5 TB),
indexed by ContentMine . Disk 30 GBP Raspberry Pi3. 50 GBP
CC BY, PeterMR
Disk
Raspberry PI
Power
36. bioRxiv in
Citizen Health Search (CHS)
A proposal to Wellcome Trust (
Open Research in Health call) with
ContentMine, Cochrane and UCL-EPPI (CCU)
CHS puts semantic search on the desktop
of the searcher. We index all the visible
Medical literature, normalize, section
and index against a bank of user-chosen
dictionaries.
CHS takes input from EPMC, bioRxiv and
emerging community sources such as
Crossref, unpaywall and outputs to Zenodo,
Wikidata and CM-Science Source.
Citizen Dashboard
37. Gene-species co-occurrence in Marchantia from bR.
PDF
Semantic Sectioned
Indexed HTML
Dashboard
West Africa
Flavivirus
600 hits
download
PDF->SVG->HTML
Dictionary etc. search
Species-gene
cooccurrence
Automatic!
Gene, country, organiz, plantparts, species Synoptic
view
Author,abstract, kword,
abbrev, intro
38. Results of searching for “ferromagnetism” on arxiv 201806-201808
And > 100 more
arxiv compchem country crystal element orgs magnetism
Bag of
words