The Content Mine
Peter Murray-Rust
University of Cambridge and Open Knowledge
Foundation
A community of people and
machines to extract
100,000,000 scientific facts
from the scholarly literature

Slides: CC-BY

Images © Wikimedia CC-BY-SA
If you’re bored …
important
THREE most important Open Access
publishers?
(besides BMC and PLoS)
THREE most important Open Access
repositories?
TDM* (“Text-and-Data-mining”) is the
use of machines to read and understand
massive amounts of documents
“The Right to Read is the Right to Mine“.
PMR + OKFN
“Closed Access means people die” (PM-R)
“Text and Data Mining saves Lives “
(John McNaught)
*PMR uses “content mining”
Who’s this?

(Credit: Seth Rosenblatt/CNET)
Aaron Swartz
Died 2012-11-08

Facing 30 years in jail for
Downloading JStor

http://news.cnet.com/8301-13578_357611642-38/call-to-action-kicks-off-secondaaron-swartz-hackathon/
(Credit: Seth Rosenblatt/CNET)
Typical papers destroy data
Numeric: astro1307.5851v4.pdf
Diagram: birds1471-2148-11-313.pdf
RCUK
Wellcome
ERC
NSF …
require
fully OPEN

[at Research Data Alliance, we are entering a new “era of open science”, which will be “good
for citizens, good for scientists and good for society”.
She explicitly highlighted the transformative potential of open access, open data, open
software and open educational resources – mentioning the EU’s policy requiring open access
to all publications and data resulting from EU funded research.
http://blog.okfn.org/2013/03/21/we-are-entering-an-era-of-open-science-says-eu-vp-neeliekroes/#sthash.3SWDXDE6.dpuf
Content Mining
•
•
•
•
•
•

Make science discoverable
Extract facts for research
Build reusable objects
Aggregate
Create new businesses
Check for errors => better science
Content Mining Problems
•
•
•
•

Secondary publishers create walled gardens
Publishers’ contracts ban content-mining.
Publishers cut off Universities who mine
Publishers lobby governments to require
“licences for content mining”

• UK Hargreaves legislation will override this by
law. Starts 2014.
http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-ourdigital-future-peter-murray-rust-is-the-problem/
Walled Gardens (“Free” but not “Open”)
service provider
has control over
applications, cont
ent, and media
and restricts
convenient access
to non-approved
applications or
content.
Examples: Mendeley, Facebook, Cambridge
Crystallographic Data Centre, OCLC
#animalgarden “Walled Gardens” https://vimeo.com/34323486
http://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden
Licences destroy Content Mining

STM Publishers Licence

WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
• Ross Mounce
• PM-R

2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: PMR has NO rights)
• *cannot publish to: + “libraries, repositories, or archives”
• *cannot+ “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a *…+ fee”

Heather Piwowar: “negotiating with publishers *made me physically ill+”
Licensing TDM is like publishers taxing spectacles
We can’t turn a hamburger into a cow

But we can now
turn PDFs into
Science
Zoom in …
TITLES

DATA!!
2000+ points
UNITS
TICKS
SCALE
QUANTITY
Dumb PDF

Automatic
extraction

CSV
Gaussian
Filter

2nd Derivative

Semantic
Spectrum
PDF 

AMI
HTML 
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle
Håstad 2,3 and Per Alström 4

Styles , superscripts
And diåcritics
preserved!
PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus

Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
AMI
0.84
0.91
0.93
0.95
Posterior
probability

23.12
34.54
37.21
38.55

NexML
HTML

AMI can MEASURE
Branch lengths!

Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma

Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae

Genus

Family
10 million spectra published /year
Review of the NMR data reported in the Supporting
Information in this article evidences instances where some of
the spectra were inappropriately edited to remove
impurities. A coauthor and former student, Dr. Bruno
Anxionnat, has shared with me formal communication in
which he states “I would like to take full responsibility for this
entire situation. I was in charge of making the SI of my papers
and I erased some peaks without telling anybody. All my
supervisors (Pr. Cossy, Dr. Gomez Pardo and Dr. Ricci) trusted
me and I wasn't dependable. I am the only one who has to be
blamed for all that, in any case them. I know my behavior is
highly unethical. I am deeply sorry for what I have done and
for hurting people….”
Crystallography Walled Garden
service provider
has control over
applications, conte
nt, and media and
restricts
convenient access
to non-approved
applications or
content.
From Saulius Grazulis
Crystaleye
• A database of 200,000 crystal structures scraped
from publications CIF supplemental information
• CML molecules and name-value pairs
• Re-usable as fragment base
Nick Day, Jim Downing, Sam Adams, N. W. England
and Peter Murray-Rust*
J.Appl.Cryst. (2012). 45 , 316–
323, doi:10.1107/S0021889812006462
http://wwmm.ch.cam.ac.uk/crystaleye
“nuggets” in a scientific paper
places

project
Value ranges

quantity
units
chemical
Humans aren’t designed to mine this … 
The Content Mine
A community of people and machines to
extract scientific facts from the scholarly
literature on a global scale.

https://vimeo.com/78353557
AMI

100,000 lines of Open code for translating PDFs to science.
10 years work (PMR).
AMI works!
We have friends

• ProPublica is a NY digital-democracy newspaper
• Tabula is an Open PDF-table extractor
• Mozilla fights for web freedom
Boot-Camps and hacks

Open Science, Oxford 2013-11-27
(sold out before announcement!)
Collaborators:
I have talked with:
• BMC
• PLoS
• British Library
• Mozilla
• Software Carpentry
• EuropePMC
• Creative Commons
• OKFN
I hope to talk with:
• Wellcome
• JISC
• Ubiquity
• Royal Society
• Kitware
• SPARC
• …
• …
•
•
•
•
•

“The right to read is the right to mine”
Unrestricted TDM saves lives
Libraries – reject TDM restrictions
Publishers – Damascene conversion 
Funders – insist on CC-BY

@petermurrayrust
http://blogs.ch.cam.ac.uk/pmr
3 most important Open Access repositories?
• Wikimedia
• Github, StackOverflow.
• National libraries and museums.

3 most important Open Access publishers?
•
•
•
•

Wikipedia
NIH+EBI+OtherBioDatabases
arXiv, CERN/SCOAP
+PLoS+BMC
300 Billion USD annually on
Science+Medicine
FACTS! LOST! FACTS! LOST! FACTS! LOST! FACTS!
• “we repeat about 25% of our chemistry
because we didn’t know we’d done it already”
• 10,000 phylogenetic trees at 25,000 USD each;
only 4% have data (loss = 240 Million USD)
• Computational chemistry – materials NO
DATA, perhaps 1,000,000,000 USD
FACTS! LOST! FACTS! LOST! FACTS! LOST! FACTS!

The Content Mine (presented at UKSG)

  • 1.
    The Content Mine PeterMurray-Rust University of Cambridge and Open Knowledge Foundation A community of people and machines to extract 100,000,000 scientific facts from the scholarly literature Slides: CC-BY Images © Wikimedia CC-BY-SA
  • 3.
    If you’re bored… important THREE most important Open Access publishers? (besides BMC and PLoS) THREE most important Open Access repositories?
  • 4.
    TDM* (“Text-and-Data-mining”) isthe use of machines to read and understand massive amounts of documents “The Right to Read is the Right to Mine“. PMR + OKFN “Closed Access means people die” (PM-R) “Text and Data Mining saves Lives “ (John McNaught) *PMR uses “content mining”
  • 5.
  • 6.
    Aaron Swartz Died 2012-11-08 Facing30 years in jail for Downloading JStor http://news.cnet.com/8301-13578_357611642-38/call-to-action-kicks-off-secondaaron-swartz-hackathon/ (Credit: Seth Rosenblatt/CNET)
  • 7.
    Typical papers destroydata Numeric: astro1307.5851v4.pdf Diagram: birds1471-2148-11-313.pdf
  • 8.
    RCUK Wellcome ERC NSF … require fully OPEN [atResearch Data Alliance, we are entering a new “era of open science”, which will be “good for citizens, good for scientists and good for society”. She explicitly highlighted the transformative potential of open access, open data, open software and open educational resources – mentioning the EU’s policy requiring open access to all publications and data resulting from EU funded research. http://blog.okfn.org/2013/03/21/we-are-entering-an-era-of-open-science-says-eu-vp-neeliekroes/#sthash.3SWDXDE6.dpuf
  • 9.
    Content Mining • • • • • • Make sciencediscoverable Extract facts for research Build reusable objects Aggregate Create new businesses Check for errors => better science
  • 10.
    Content Mining Problems • • • • Secondarypublishers create walled gardens Publishers’ contracts ban content-mining. Publishers cut off Universities who mine Publishers lobby governments to require “licences for content mining” • UK Hargreaves legislation will override this by law. Starts 2014. http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-ourdigital-future-peter-murray-rust-is-the-problem/
  • 11.
    Walled Gardens (“Free”but not “Open”) service provider has control over applications, cont ent, and media and restricts convenient access to non-approved applications or content. Examples: Mendeley, Facebook, Cambridge Crystallographic Data Centre, OCLC #animalgarden “Walled Gardens” https://vimeo.com/34323486
  • 13.
  • 14.
    Licences destroy ContentMining STM Publishers Licence WE WALKED OUT • Brit Library • JISC • RLUK • OKFN • … • Ross Mounce • PM-R 2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights) • *cannot publish to: + “libraries, repositories, or archives” • *cannot+ “Make the results of any TDM Output available on an externally facing server or website” • “Subscriber shall pay a *…+ fee” Heather Piwowar: “negotiating with publishers *made me physically ill+”
  • 15.
    Licensing TDM islike publishers taxing spectacles
  • 16.
    We can’t turna hamburger into a cow But we can now turn PDFs into Science
  • 17.
  • 18.
  • 19.
  • 20.
    PDF  AMI HTML  Evolutionof ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad 2,3 and Per Alström 4 Styles , superscripts And diåcritics preserved!
  • 21.
    PDF  Turdus iliacus Taeniopygiaguttata Serinus canaria Lanius excubitor Melopsittacus undulatus Pavo cristatus Sturnus vulgaris Dolichonyx oryzivorus Ficedula hypoleuca Vaccinium myrtillus Falco tinnunculus Turdus Pomatostomus Leothrix Amytornis Acanthisitta Orthonyx x 2 Malurus Cnemophilus x 4 Philesturnus x 2 Motacilla x 2 Toxorhampus x 2
  • 22.
    Typical phylo tree:60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
  • 23.
    AMI 0.84 0.91 0.93 0.95 Posterior probability 23.12 34.54 37.21 38.55 NexML HTML AMI can MEASURE Branchlengths! Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae Genus Family
  • 24.
    10 million spectrapublished /year
  • 27.
    Review of theNMR data reported in the Supporting Information in this article evidences instances where some of the spectra were inappropriately edited to remove impurities. A coauthor and former student, Dr. Bruno Anxionnat, has shared with me formal communication in which he states “I would like to take full responsibility for this entire situation. I was in charge of making the SI of my papers and I erased some peaks without telling anybody. All my supervisors (Pr. Cossy, Dr. Gomez Pardo and Dr. Ricci) trusted me and I wasn't dependable. I am the only one who has to be blamed for all that, in any case them. I know my behavior is highly unethical. I am deeply sorry for what I have done and for hurting people….”
  • 28.
    Crystallography Walled Garden serviceprovider has control over applications, conte nt, and media and restricts convenient access to non-approved applications or content.
  • 29.
  • 30.
    Crystaleye • A databaseof 200,000 crystal structures scraped from publications CIF supplemental information • CML molecules and name-value pairs • Re-usable as fragment base Nick Day, Jim Downing, Sam Adams, N. W. England and Peter Murray-Rust* J.Appl.Cryst. (2012). 45 , 316– 323, doi:10.1107/S0021889812006462 http://wwmm.ch.cam.ac.uk/crystaleye
  • 32.
    “nuggets” in ascientific paper places project Value ranges quantity units chemical Humans aren’t designed to mine this … 
  • 33.
    The Content Mine Acommunity of people and machines to extract scientific facts from the scholarly literature on a global scale. https://vimeo.com/78353557
  • 34.
    AMI 100,000 lines ofOpen code for translating PDFs to science. 10 years work (PMR). AMI works!
  • 35.
    We have friends •ProPublica is a NY digital-democracy newspaper • Tabula is an Open PDF-table extractor • Mozilla fights for web freedom
  • 37.
    Boot-Camps and hacks OpenScience, Oxford 2013-11-27 (sold out before announcement!)
  • 38.
    Collaborators: I have talkedwith: • BMC • PLoS • British Library • Mozilla • Software Carpentry • EuropePMC • Creative Commons • OKFN I hope to talk with: • Wellcome • JISC • Ubiquity • Royal Society • Kitware • SPARC • … • …
  • 39.
    • • • • • “The right toread is the right to mine” Unrestricted TDM saves lives Libraries – reject TDM restrictions Publishers – Damascene conversion  Funders – insist on CC-BY @petermurrayrust http://blogs.ch.cam.ac.uk/pmr
  • 41.
    3 most importantOpen Access repositories? • Wikimedia • Github, StackOverflow. • National libraries and museums. 3 most important Open Access publishers? • • • • Wikipedia NIH+EBI+OtherBioDatabases arXiv, CERN/SCOAP +PLoS+BMC
  • 42.
    300 Billion USDannually on Science+Medicine FACTS! LOST! FACTS! LOST! FACTS! LOST! FACTS! • “we repeat about 25% of our chemistry because we didn’t know we’d done it already” • 10,000 phylogenetic trees at 25,000 USD each; only 4% have data (loss = 240 Million USD) • Computational chemistry – materials NO DATA, perhaps 1,000,000,000 USD FACTS! LOST! FACTS! LOST! FACTS! LOST! FACTS!