Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Content Mining of Science in Europe


Published on

Talk to OpenForum Academy (Open Forum Europe) about Text and data Mining. Four use cases selected fo non-scientists. Also discussion of latest on Europena copyright reform and TDM exceptions

Published in: Science
  • Be the first to comment

Content Mining of Science in Europe

  1. 1. Content Mining of Science in Europe Peter Murray-Rust,, University of Cambridge & Open Forum Europe OFA, Brussels, BE 2015-10-22 What is mining? Why is it useful? How YOU can do it without using publishers’ APIs Copyright and restrictive practices are still a major problem
  2. 2. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
  3. 3. My European Heroes Young People(ContentMine) NEELIE KROES
  4. 4. Use Cases of ContentMining • Epidemiology of obesity (Cambridge U) • (OKF, OpenTrials) Mapping clinical trials repositories to reports in scientific literature • Mining chemical reactions from patents • Creating a bacterial supertree-of-life from 4500 papers
  5. 5. Polly has 20 seconds to read this paper… …and 10,000 more
  6. 6. ContentMine software can do this in a few minutes Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
  7. 7. 400,000 Clinical Trials In 10 government registries Mapping trials => papers 2009 => 2015. What’s happened in last 6 years?? Search the whole scientific literature For “2009-0100068-41”
  8. 8. ContentMine-ing strategy • Discover. Crawl the COMPLETE relevant literature. => bibliography • Scrape (download). ALL papers • Index papers => Facts • Search/analyze papers => complex science • Extract, Annotate, Aggregate (“Transformative”)
  9. 9. What is “Content”? 03&representation=PDF CC-BY SECTIONS MAPS TABLES CHEMISTRY TEXT MATH tackles these
  10. 10. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
  11. 11. • Typical Typical chemical synthesis
  12. 12. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  13. 13. Facts in context daily IUCN endangered species news CC By-SA
  14. 14. ContentMine Fact of The Day • Fact of the day • Endangered species in recent science • Facts • Bubbles
  15. 15. CC BY-SA
  16. 16. “Root” 4500 papers each with 1 tree
  17. 17. OCR (Tesseract) Norma (imageanalysis) (((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga _maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat um:167):217):11):9); Semantic re-usable/computable output (ca 4 secs/image)
  18. 18. Supertree for 924 species Tree
  19. 19. Supertree created from 4300 papers
  20. 20. Copyright and Mining • PMR-premise: You cannot do reproducible scientific mining and avoid violating copyright. • UK (“Hargreaves”) 2014 legislation: – “personal” “non-commercial*” “research” “data analytics” – legitimizes copying (?to disk), but not publishing *teaching, textbooks, etc. may be “commercial”
  21. 21. Publishing and ICT Trust these as much as you trust these Elsevier Microsoft Mendeley (Elsevier) Facebook Digital Science/Macmillan Apple Wiley etc Etc.
  22. 22. STM Publishers prevent Mining • FUD & disinformation about legality (Elsevier) • Monopolies on infrastructure (“API”s, CCC Rightfind) • Technical obstruction (Wiley Captcha, Macmillan Readcube) • Restrictive contracts with libraries (ALL) [1] • Wasting my/our time (ALL) [1] [You may not] utilize the TDM Output to enhance … subject repositories in a way that would [… ] have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.
  23. 23. WILEY … “new security feature… to prevent systematic download of content “[limit of] 100 papers per day” “essential security feature … to protect both parties (sic)” CAPTCHA User has to type words
  24. 24. ContentMine working with Libraries • Cambridge: Library, Plant Sciences, Epidemiology, Chemistry • Cochrane Collaboration on Systematic Reviews of Clinical Trials • FutureTDM (H2020, LIBER) • Running workshops and training