Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The ContentMine 
Scraping Stack 
Richard Smith-Unna! ! Peter Murray-Rust 
University of Cambridge
“make 100,000,000 facts from the scholarly 
literature open, accessible and reusable” 
our mission
The scale of the task 
• ~ 27,000 peer reviewed journals (Ulrich's) 
• > 5,000 publishers 
• new papers every day
The pipeline
scraperJSON 
• scrapers all have the same plumbing 
• ignore the plumbing, just configure 
• benefits 
• supports large co...
Basic scraperJSON 
{ 
"name": "PLOS", 
"url": "plosw*.org", 
"elements": { 
"title": { 
"selector": “//h1[@property=‘dc:ti...
Basic scraperJSON 
{ 
"name": "PLoS", 
"url": "plosw*.org", 
"elements": { 
"title": { 
"selector": “//h1[@property=‘dc:ti...
bibJSON output 
{ 
"title": "Ab Initio Identification of Novel 
Regulatory Elements in the Genome of Trypanosoma 
brucei b...
thresher & quickscrape 
• reference implementation of scraperJSON 
• thresher is the scraping library 
• http://github.com...
journal-scrapers 
http://github.com/ContentMine/journal-scrapers 
a self-testing collection of scraperJSON scrapers for ac...
Future work 
• GUI (browser plugin) for creating scrapers 
• Standalone GUI for scraping
Acknowledgements 
• Peter Murray-Rust 
• Michelle Brook 
• Mark MacGillivray 
• Emanuil Tolev 
• Ross Mounce 
• Jenny Moll...
The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature
The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature
The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature
Upcoming SlideShare
Loading in …5
×

The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

806 views

Published on

The ContentMine project (http://contentmine.org) will harvest 100 million facts from the literature. Here we summarise the technology stack we're building to enable the first step: collecting the literature.

This presentation was given with a paper (https://github.com/Blahah/scraperJSON-demo-paper) at WOSP 2014.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

  1. 1. The ContentMine Scraping Stack Richard Smith-Unna! ! Peter Murray-Rust University of Cambridge
  2. 2. “make 100,000,000 facts from the scholarly literature open, accessible and reusable” our mission
  3. 3. The scale of the task • ~ 27,000 peer reviewed journals (Ulrich's) • > 5,000 publishers • new papers every day
  4. 4. The pipeline
  5. 5. scraperJSON • scrapers all have the same plumbing • ignore the plumbing, just configure • benefits • supports large collections of scrapers • no programming required • not limited to one piece of software
  6. 6. Basic scraperJSON { "name": "PLOS", "url": "plosw*.org", "elements": { "title": { "selector": “//h1[@property=‘dc:title’]”, } } } ! name of the scraper the URL(s) it applies to the elements to capture element name where to find it ! ! http://github.com/ContentMine/scraperJSON
  7. 7. Basic scraperJSON { "name": "PLoS", "url": "plosw*.org", "elements": { "title": { "selector": “//h1[@property=‘dc:title’]”, } } } ! name of the scraper the URL(s) it applies to the elements to capture element name where to find it ! ! http://github.com/ContentMine/scraperJSON
  8. 8. bibJSON output { "title": "Ab Initio Identification of Novel Regulatory Elements in the Genome of Trypanosoma brucei by Bayesian Inference on Sequence Segmentation" }
  9. 9. thresher & quickscrape • reference implementation of scraperJSON • thresher is the scraping library • http://github.com/ContentMine/thresher • quickscrape is the command-line tool • http://github.com/ContentMine/quickscrape • Node.js, MIT licensed
  10. 10. journal-scrapers http://github.com/ContentMine/journal-scrapers a self-testing collection of scraperJSON scrapers for academic journals • PLOS • MDPI • PeerJ • Wiley • ScienceDirect • Springer • Taylor & Francis • NPG, AAAS, RSC, ACS, …
  11. 11. Future work • GUI (browser plugin) for creating scrapers • Standalone GUI for scraping
  12. 12. Acknowledgements • Peter Murray-Rust • Michelle Brook • Mark MacGillivray • Emanuil Tolev • Ross Mounce • Jenny Molloy http://contentmine.org http://github.com/ContentMine • Our volunteer community and collaborators • Funding: Shuttleworth Foundation

×