Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ContentMine (TDM) at JISC Digifest

672 views

Published on

The latest developments in mining scientific documents using TheContentMine open technology

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

ContentMine (TDM) at JISC Digifest

  1. 1. Content Mining (TDM) Peter Murray-Rust, ContentMine.org and UniversityofCambridge JISC Digifest, Birmingham, UK, 2016-03-02 Invited and Sponsored by JISC F/OSS tools from contentmine.org Images from Wikimedia CC-BY-SA
  2. 2. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011 http://contentmine.org
  3. 3. Overview • Open Semistructured Documents .are the most exciting underutilised knowledge resource – Scholarly literature – Theses – Clinical trials – Government and NGO publications – Product information … • Content Mining can make huge contributions. • EuropePubMedCentral(*) is the world’s best place to start. • Socio-politico-legal aspects cannot be ignored. • (*) Wellcome Trust, RCUK, FWF (Austria), Cancer Research UK, NHS UK ….
  4. 4. Mining strategy • Discover. negotiate permissions . => bibliography • Crawl / Scrape (download), documents AND supplemental • Normalize. PDF => XML • Index: facets => Facts and snippets (“entities”) • Interpret/analyze entities => relationships, aggregations (“Transformative”) • Publish
  5. 5. catalogue getpapers query Daily Crawl EPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
  6. 6. Want to know about Zika? Just Type: ZIKA!
  7. 7. Semantic Fulltext • EuropePMC coherent OpenAccess • getpapers: query , download (through API). • AMI filters, checks[1], transforms facts in papers. • sequences, species, genera, genes, dictionaries [0] All operations shown run in total of <3 minutes. [1] Dictionaries and lookup. [2] Usable from home by anyone Zika endemic areas Wikimedia CC-BY-SA
  8. 8. Download all Open Access “Zika” from EuropePMC in 10 seconds (click below for movie) Aedes aegypti, Wikimedia CC-BY-SA Note: movies of this and other slides can be seen at https://vimeo.com/154705161
  9. 9. Downloaded all Open Access “Zika” from EuropePMC in 10 seconds Final download screen
  10. 10. Eyeballing 20/120 Zika papers, click below for movie Yellow Fever Virus Wikimedia CC-BY-SA Note: movie of this and other slides can be seen at https://vimeo.com/154705161
  11. 11. 3011 virus 1939 Ae./Aedes 1212 dengue 901 mosquito/es 894 species 791 ZIKV 721 using 716 DENV 567 detection 513 aegypti 484 infection 442 RNA 428 protein 401 albopictus 360 viral Commonest words in 120 Zika papers Mosquito spp. Wikimedia CC-BY-SA
  12. 12. Filtering local files for sequence and viruses AMI (part of ContentMine software) (click below for movie) Note: movies of this and other slides can be seen at https://vimeo.com/154705161
  13. 13. DNA Primers in running text …the sodium channel voltage dependent gene (Nav). Primers used to amplify this fragment were AaNaA 5’-ACAATGTGGATCGCTTCCC-3’ and AaNaB 5’-TGGACAAAAGCAAGGCTAAG-3’(8). The primers amplify a fragment of approximately 472… Snippet (quotable under 2014 UK Statutory Instrument (“Hargreaves”): ~/PMC4654492/results/sequence/dnaprimer/results.xml” W3C Annotation [PREFIX] [MATCH] (link to target) [SUFFIX] CMine structure plugin option DNA double stranded fragment Wikimedia CC-BY-SA
  14. 14. Commonest species in 120 Zika papers 423 Ae./Aedes aegypti 333 Ae./Aedes albopictus 63 Ae. bromeliae 58 Ae. lilii 46 Ae. hensilli 42 Glossina pallidipes 40 Plasmodium vivax 35 Ae. luteocephalus 28 Ae. vittatus 25 Ae. furcifer 22 Plasmodium falciparum 21 Drosophila melanogaster pre=“fever (DHF), are caused by the world's most prevalent mosquito-borne virus. 37 DENV is carried by " exact="Aedes aegypti” post=" mosquito, which is strongly affected by ecological and human drivers, but also influenced by clima" name="binomial"/>
  15. 15. 183 Wolbachia 70 Aedes 69 Flavivirus/Flaviviridae 30 Glossina 17 Culex Commonest genera in Zika papers pre=”…-negative endosymbiotic bacterium, is a promising tool against diseases transmitted by mosquitoes. " exact="Wolbachia” post=" can be found worldwide in numerous arthropod species. More than 65% of all insect species are natu…” Wolbachia in insect cell Wikimedia CC-BY-SA
  16. 16. 38 ITS 20 MHC2TA 19 COI 14 CYPJ92 5 CYP6BB2 4 CYP9J28 3 MHC Commonest genes in 120 Zika papers
  17. 17. • microcephaly 400/2400 papers; 2 mins; commonest genes: 203 MCPH1 86 MECP2 54 SOX2 49 E2F1 47 SNAP29 40 IKBKG 40 NDE1 N-terminal domain of microcephalin Wikimedia CC-BY-SA
  18. 18. Systematic Reviews Researchers and their machines need to “read” hundreds of papers a day or even more.
  19. 19. Polly has 20 seconds to read this paper… …and 10,000 more
  20. 20. ContentMine software can do this in a few minutes Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
  21. 21. 400,000 Clinical Trials In 10 government registries Mapping trials => papers http://www.trialsjournal.com/content/16/1/80 2009 => 2015. What’s happened in last 6 years?? Search the whole scientific literature For “2009-0100068-41”
  22. 22. Extracting scientific information
  23. 23. Mining strategy • Discover. negotiate permissions . => bibliography • Crawl / Scrape (download), documents AND supplemental • Normalize. PDF => XML • Index: facets => Facts and snippets (“entities”) • Interpret/analyze entities => relationships, aggregations (“Transformative”) • Publish
  24. 24. What is “Content”? http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113 03&representation=PDF CC-BY SECTIONS MAPS TABLES CHEMISTRY TEXT MATH contentmine.org tackles these
  25. 25. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
  26. 26. http://chemicaltagger.ch.cam.ac.uk/ • Typical Typical chemical synthesis
  27. 27. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  28. 28. Facts in context daily IUCN endangered species news en.wikipedia.org CC By-SA
  29. 29. ContentMine Fact of The Day • Fact of the day • Endangered species in recent science • Facts • Bubbles
  30. 30. https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA
  31. 31. “Root” 4500 papers each with 1 tree
  32. 32. OCR (Tesseract) Norma (imageanalysis) (((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga _maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat um:167):217):11):9); Semantic re-usable/computable output (ca 4 secs/image)
  33. 33. Supertree for 924 species Tree
  34. 34. Supertree created from 4300 papers
  35. 35. Socio-politico-legal • TDM is one of the most complex, uncertain, confrontational, political, areas of human endeavour.
  36. 36. Copyright and Mining • PMR-premise: You cannot do reproducible scientific mining and avoid violating copyright. • UK (“Hargreaves”) 2014 legislation: – “personal” “non-commercial*” “research” “data analytics” – legitimizes copying (?to disk), but not publishing *teaching, textbooks, etc. may be “commercial”
  37. 37. STM Publishers prevent Mining • FUD & disinformation about legality (Elsevier) • Monopolies on infrastructure (“API”s, CCC Rightfind) • Technical obstruction (Wiley Captcha, Macmillan Readcube) • Restrictive contracts with libraries (ALL) [1] • Wasting my/our time (ALL) [1] [You may not] utilize the TDM Output to enhance … subject repositories in a way that would [… ] have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.
  38. 38. WILEY … “new security feature… to prevent systematic download of content “[limit of] 100 papers per day” “essential security feature … to protect both parties (sic)” CAPTCHA User has to type words
  39. 39. ContentMine working with Libraries • Cambridge: Library, Plant Sciences, Epidemiology, Chemistry • Cochrane Collaboration on Systematic Reviews of Clinical Trials • FutureTDM (H2020, LIBER) • Running workshops and training
  40. 40. CM Future • Hypothes.is use ContentMine results for annotation • (with Cambridge Univ Library) extracting daily scientific facts from open and closed literature. • with EBI, Cochrane Collaborations, JISC, OKF, LIBER, TGAC/JohnInnes, DNADigest. • Running workshops, hackdays. • Planned outreach: MEPs, EC, Slashdot, Reddit, Kickstarter, geekdom • http://contentmine.org (OpenLock non-profit)
  41. 41. ContentMine working with Libraries • Cambridge: Library, Plant Sciences, Epidemiology, Chemistry • Cochrane Collaboration on Systematic Reviews of Clinical Trials • FutureTDM (H2020, LIBER) • Running workshops and training • Offers services for information extraction and indexing for born-digital documents.
  42. 42. Tractable Open Repositories • CORE • OpenAIRE • arXiv • HAL
  43. 43. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011 http://contentmine.org

×