Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scholarly Infrastructure: Open or Closed?
Peter Murray-Rust*,
University of Cambridge and OpenKnowledge
DRTD-SHS, Lille, F...
Scholarly infrastructure becomes closed
No accountability for monitoring and control
The Digital Enlightenment: some of my icons
Diderot, Paris, 1751
Berkeley, US, 1966 Paris, 1968
UK, 1969-73
["How We Stopped SOPA”:
This bill ... shut down whole websites. Essentially, it stopped Americans from
communicating entir...
Some Children
of the Digital Enlightenment
• David Carroll & Joe McArthur: OAButton
• Rayna Stamboliyska & Pierre-Carl Lan...
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled a...
Open Scholarship must build its own
discovery system before it is too late
Communities of Practice + software:
• Wikip(m)e...
eTheses
• Citizens pay $20,000,000,000*…
• … for research in 200,000 science theses*…
• … cost $100,000 each to create* …
...
Linked Open Data – the world’s knowledge
very little physical science and THESES?? 
http://upload.wikimedia.org/wikipedia...
Liberation Software
Steve Coast developed OpenStreetMap
to challenge the monopoly of the UK Ordnance Survey
The Right to Read is the Right to Mine
http://contentmine.org
OUR TEAM
@jenny_molloy
Ross Mounce
@rmounce
Richard Smith-
Unna
@blahah404
Stephanie Smith-
Unna
@treblesteph
Jenny Molloy...
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-
enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_Ap...
Content-Mining (TDM*)
• Now COMPLETELY LEGAL IN UK since 2014-06-01
(“Hargreaves”)…
• … Whatever the publishers tell you. ...
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representa...
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places
What is “Content”?
Emily Sena (neuroscience.ed.ac.uk) spends
half a day digitising a diagram like this
ContentMine will so...
• CRAWL the web for scientific documents
(articles, grey literature, repositories)
• quickSCRAPE pages (text, graphics, im...
quickscrape
Crawl
Feed
Norma Index &
Transform
PDF
XML
URL
DOI
Scientific
literature
Repositories DOC
CSV
sHTML
Plugins
Re...
CORE Repository UK
HAL repository FR
Retrieval/Extraction Technologies
• Bag Of Words https://en.wikipedia.org/wiki/Bag-of-words_model)
• Term-Frequency Invers...
Bag of Words
Theses from HAL repository
Species
Regex for Clinical Trials
CLINICAL TRIALS
How to we find (mentions of) clinical trials?
Is a document a (clinical) trial?
What is the subject of the...
How a machine reads a chemical thesis
nodes are compounds; arrows are reactions
Natural Language Processing
Part of speech tagging (Wordnet, Brown Corpus, etc.)
Parsing chemical sentences
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000...
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-13...
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle
Håstad 2,3 and Per ...
PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus v...
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AM...
Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae
0.84
0.91
0.93
0.95
Acanthisit...
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-
mine-images-of-phylogenetic-trees-and-more/ for story...
Problems
• Cannot do handwriting
• Scanned documents give poorer results
• The older the document the poorer the result
• ...
Additional material on Open Notebook
Science (not presented)
Free/Open Software Development
Engineered
repository
World
community
CODE
rewrite
validate
CODE
fork
CODE
Re-use
CODE
Re-u...
Sophie Kershaw, Panton Fellow, Training PhD Students
“Do you think you would be
more confident in the future
about trying to apply Open
techniques to your work..?”
• 50% Yes, ...
Rotation-Based Learning (RBL)
Phase 1: Initiator
• No communication
permitted between groups
• Attempt to reproduce
existi...
http://michaelnielsen.org/blog/reinventing-
discovery/
http://en.wikipedia.org/wiki/Reinventing_Discovery
TOOLS
Open Notebook Science
Open
engineered
repository
World
community
INSTRUMENT
validate
merge
MODEL
CODE
DATA
DATA
know...
“Free” and “Open”
• "Free software is a matter of liberty, not price.
’free speech', not 'free beer'”. (R M Stallman)
• “A...
Critical Historical Open Events
• Free Software Foundation (RMS,
1985) and Linux (Torvalds, 1991)
• The World Wide Web (TB...
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted acce...
Panton Authors and Fellows
Upcoming SlideShare
Loading in …5
×

ContentMine: Liberating scholarship from Open publications and theses

1,417 views

Published on

Theses represent a huge amount of untapped value. We show how contentmine.org technology can be used to mine them and extract knowledge

Published in: Education
  • Login to see the comments

ContentMine: Liberating scholarship from Open publications and theses

  1. 1. Scholarly Infrastructure: Open or Closed? Peter Murray-Rust*, University of Cambridge and OpenKnowledge DRTD-SHS, Lille, FR 2015-04-21 We can build an Open discovery and re-use system. Theses represent huge untapped communal knowledge. Bliss was it in that dawn to be alive, But to be young was very heaven! Wordsworth on the French Revolution
  2. 2. Scholarly infrastructure becomes closed No accountability for monitoring and control
  3. 3. The Digital Enlightenment: some of my icons Diderot, Paris, 1751 Berkeley, US, 1966 Paris, 1968 UK, 1969-73
  4. 4. ["How We Stopped SOPA”: This bill ... shut down whole websites. Essentially, it stopped Americans from communicating entirely with certain groups.... I called all my friends, and we stayed up all night setting up a website for this new group, Demand Progress, with an online petition opposing this noxious bill.... We [got] ... 300,000 signers.... We met with the staff of members of Congress and pleaded with them.... And then it passed unanimously.... And then, suddenly, the process stopped. Senator Ron Wyden ... put a hold on the bill.[48][49] He added, "We won this fight because everyone made themselves the hero of their own story. Everyone took it as their job to save this crucial freedom.” Robert Swartz: "Aaron was killed by the government, and MIT betrayed all of its basic principles."[116] Aaron Swartz
  5. 5. Some Children of the Digital Enlightenment • David Carroll & Joe McArthur: OAButton • Rayna Stamboliyska & Pierre-Carl Langlais • Jon Tennant • Ross Mounce • Jenny Molloy • Erin McKiernan • Jack Andraka • Michelle Brook • Heather Piwowar • TheContentMine Team • Rufus Pollock • Jonathan Gray • Sophie Kay Jean-Claude Bradley [1] a chemist developed Open notebook science; making the entire primary record of a research project publicly available online as it is recorded. (WP) J-C promoted these ideas with UNDERGRADUATE scientists. [1] Unfortunately J-C died in 2014; we held a memorial meeting in Cambridge Sophie Kay
  6. 6. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about- ebola.html We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection. Adage in public health: “The road to inaction is paved with research papers.” Bernice Dahn (chief medical officer of Liberia’s Ministry of Health) Vera Mussah (director of county health services) Cameron Nutt (Ebola response adviser to Partners in Health) A System Failure of Scholarly Publishing
  7. 7. Open Scholarship must build its own discovery system before it is too late Communities of Practice + software: • Wikip(m)edia • Open Street Map • Open Corporates Theses are under OUR control and hugely valuable.
  8. 8. eTheses • Citizens pay $20,000,000,000*… • … for research in 200,000 science theses*… • … cost $100,000 each to create* … • … re-use ??? (near zero) • … Value??? • *Please challenge these numbers… • NOTE: we pay publishers $15,000,000,000 for journals and APCs
  9. 9. Linked Open Data – the world’s knowledge very little physical science and THESES??  http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png DBPedia BIO Comp Lib PDB Ontologies GOV GOV.uk Music, Art Literature Social Knowledge bases RDF triples
  10. 10. Liberation Software Steve Coast developed OpenStreetMap to challenge the monopoly of the UK Ordnance Survey
  11. 11. The Right to Read is the Right to Mine http://contentmine.org
  12. 12. OUR TEAM @jenny_molloy Ross Mounce @rmounce Richard Smith- Unna @blahah404 Stephanie Smith- Unna @treblesteph Jenny Molloy Mark MacGillivray @cottagelabs Peter Murray- Rust @petermurrayrust Charles Oppenheim @CharlesOppenh Graham Steel @McDawg
  13. 13. https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump- enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0 Daily Stream of 100,000 Open Facts Twitter?Indexed by CAT http://catalogue.cottagelabs.com/browsehttp://catalogue.cottagelabs.com/graph
  14. 14. Content-Mining (TDM*) • Now COMPLETELY LEGAL IN UK since 2014-06-01 (“Hargreaves”)… • … Whatever the publishers tell you. Do NOT sign their APIs • UK can legally IGNORE contractual restrictions • Movement to extend this to Europe (Julia Reda, MEP proposal) • And STM publishers are spending millions to stop us *Text and Data Mining
  15. 15. What is “Content”? http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113 03&representation=PDF CC-BY SECTIONS MAPS TABLES CHEMISTRY TEXT MATH contentmine.org tackles these
  16. 16. “nuggets” in a scientific paper quantity units Value ranges Humans aren’t designed to mine this …  chemical project places
  17. 17. What is “Content”? Emily Sena (neuroscience.ed.ac.uk) spends half a day digitising a diagram like this ContentMine will soon be able to do it in 1 second
  18. 18. • CRAWL the web for scientific documents (articles, grey literature, repositories) • quickSCRAPE pages (text, graphics, images, data) • NORMA-lize page to semantic form …Open semantic science … • MINE pages with your methods and tools (AMI) • CAT-alogue results in searchable index • Automate daily process (CANARY) contentmine.org Infrastructure
  19. 19. quickscrape Crawl Feed Norma Index & Transform PDF XML URL DOI Scientific literature Repositories DOC CSV sHTML Plugins Regex SequencesSpecies Bespoke Scrapers XPathPer-Journal Taggers Per- Journal MetadataChemistry Phylogenetics Farming AMI BadHTML OCR Diagrams Open NORMA-lized Scientific Literature + Facts CANARY pipeline CAT-alogue index
  20. 20. CORE Repository UK
  21. 21. HAL repository FR
  22. 22. Retrieval/Extraction Technologies • Bag Of Words https://en.wikipedia.org/wiki/Bag-of-words_model) • Term-Frequency Inverse-Document-Frequency https://en.wikipedia.org/wiki/Tf%E2%80%93idf • Regular Expressions • Templates (Information Extraction) • Natural Language Processing (NLP) • Image processing and mining • Lookup (Wikidata, Bioscience databases)
  23. 23. Bag of Words Theses from HAL repository
  24. 24. Species
  25. 25. Regex for Clinical Trials
  26. 26. CLINICAL TRIALS How to we find (mentions of) clinical trials? Is a document a (clinical) trial? What is the subject of the trial? What is the methodology used? How many/long? Does the design and practice conform to CONSORT? What are the outcomes? Can we extract specific re-usable information? Who are involved? (researchers, sponsors, patients?) Has a proposed trial been completed and reported?
  27. 27. How a machine reads a chemical thesis nodes are compounds; arrows are reactions
  28. 28. Natural Language Processing Part of speech tagging (Wordnet, Brown Corpus, etc.)
  29. 29. Parsing chemical sentences
  30. 30. http://chemicaltagger.ch.cam.ac.uk/ • Typical Typical chemical synthesis
  31. 31. Automatic semantic markup of chemistry Could be used for analytical, crystallization, etc.
  32. 32. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  33. 33. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY: AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other CLICK HERE FOR ANIMATION (may be browser dependent)
  34. 34. Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad 2,3 and Per Alström 4 PDF  HTML  Styles , superscripts And diåcritics preserved! AMI
  35. 35. PDF  Turdus iliacus Taeniopygia guttata Serinus canaria Lanius excubitor Melopsittacus undulatus Pavo cristatus Sturnus vulgaris Dolichonyx oryzivorus Ficedula hypoleuca Vaccinium myrtillus Falco tinnunculus Turdus Pomatostomus Leothrix Amytornis Acanthisitta Orthonyx x 2 Malurus Cnemophilus x 4 Philesturnus x 2 Motacilla x 2 Toxorhampus x 2
  36. 36. Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
  37. 37. Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae 0.84 0.91 0.93 0.95 Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma AMI 23.12 34.54 37.21 38.55 Posterior probability AMI can MEASURE Branch lengths! NexML Genus Family HTML
  38. 38. https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now- mine-images-of-phylogenetic-trees-and-more/ for story of extraction Thinning Topology Serialization Newick
  39. 39. Problems • Cannot do handwriting • Scanned documents give poorer results • The older the document the poorer the result • Tables are a major problem • Always try to get the original document • XML better than > Word better than > PDF • Vector images >> PNG > JPEG • Maths, chemistry are specialist
  40. 40. Additional material on Open Notebook Science (not presented)
  41. 41. Free/Open Software Development Engineered repository World community CODE rewrite validate CODE fork CODE Re-use CODE Re-use Github, BitBucket StackOverflow, Apache inspires OSI Example: ContentMine at http://github.com/ContentMine/quickscrape
  42. 42. Sophie Kershaw, Panton Fellow, Training PhD Students
  43. 43. “Do you think you would be more confident in the future about trying to apply Open techniques to your work..?” • 50% Yes, by myself • 41% Yes, with help/guidance • 9% No opinion/neutral • 0% No
  44. 44. Rotation-Based Learning (RBL) Phase 1: Initiator • No communication permitted between groups • Attempt to reproduce existing literature • Deliver a coherent research story by the end of Phase 1 Phase 2: Successor • Communication between groups still prohibited • Validate and develop the inherited research story • Critique your predecessors • Role of research producer vs. research user • Can this approach help to foster awareness of reproducibility issues? Throughout Phases 1 & 2: • Daily lectures on open science culture & techniques • First-hand application to own research work • Version control using GitHub • Daily group supervision
  45. 45. http://michaelnielsen.org/blog/reinventing- discovery/ http://en.wikipedia.org/wiki/Reinventing_Discovery
  46. 46. TOOLS Open Notebook Science Open engineered repository World community INSTRUMENT validate merge MODEL CODE DATA DATA knowledge calibrate Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous Machines and humans Working together CC-BY
  47. 47. “Free” and “Open” • "Free software is a matter of liberty, not price. ’free speech', not 'free beer'”. (R M Stallman) • “A piece of data or content is open if anyone is free to use, reuse, and redistribute it” (OKFN)http://opendefinition.org/ • “open” (access) has multiple incompatible “definitions”. Major split is “human eyeballs” vs copying and machine “reusability” • “Open” is a marketing term for publishers, who frequently (often deliberately) do not grant full Openness. “Gratis” vs “Libre”
  48. 48. Critical Historical Open Events • Free Software Foundation (RMS, 1985) and Linux (Torvalds, 1991) • The World Wide Web (TBL, 1991) • The human genome (1990-2001) The life of Aaron Swartz (1986-2013)
  49. 49. http://www.budapestopenaccessinitiative.org/read … an unprecedented public good. … … completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  50. 50. Panton Authors and Fellows

×