Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ContentMining in Neuroscience

794 views

Published on

How content mining , especially of diagrams, can help neuroscientists to read the literature effectively

Published in: Science
  • Be the first to comment

ContentMining in Neuroscience

  1. 1. Open Mining of the Bioscience Literature Peter Murray-Rust, ContentMine.org and the University of Cambridge UNAM, MX 2015-10-09 Millions of data points are hidden in the bioscience literature. ContentMine has Open technology to liberate them automatically. Using OpenNotebook approaches The major problem is politico-legal This is an exploratory talk, looking for ideas and projects The future depends on young people
  2. 2. Oxford 2013 Berlin 2014 Delhi 2014 Jenny Molloy with mascot AMI
  3. 3. Panton Authors and Fellows
  4. 4. Some particularly relevant Fellows/Alumni and projects: • Rufus Pollock: Open Knowledge Foundation • Mark Surman: Mozilla • Dan Whaley: Hypothes.is • Daniel Lombrana-Gonzales: PyBossa/Crowdcrafting Erin McKiernan, 2015 Flash Award ContentMine and Peter Murray-Rust are funded by:
  5. 5. The Right to Read is the Right to Mine http://contentmine.org
  6. 6. ContentMine Workshops and Hackdays Open Science Brazil, 2014-08 Easily distributed software Get started in 30 mins Build application in a morning Start simple: bagOfWords, Stemming, Regex, templates
  7. 7. Typical scientific paper
  8. 8. Why do we publish science? • Communicate our results • Archival • Get feedback from peers. • Provide material that others can re-use. • Priority and esteem.
  9. 9. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about- ebola.html [Liberian Ministry of Health] were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection. Adage in public health: “The road to inaction is paved with research papers.” Bernice Dahn (chief medical officer of Liberia’s Ministry of Health) Vera Mussah (director of county health services) Cameron Nutt (Ebola response adviser to Partners in Health) A System Failure of Scholarly Publishing
  10. 10. Re-use You cannot assume how others will want to re-use your work.
  11. 11. PM-R’s “first real paper”, doing science by re-using the results of others in a novel way
  12. 12. 1974: Each point is a separate paper! Needing 1-4 hours in library – discovery,hardcopy delivery, Transcription, hand calculation.
  13. 13. 1976-9: PMR and WDSM developed software And protocols to search and analyze Cambridge Crystallographic DB
  14. 14. We need machines to read the literature
  15. 15. Output of scholarly publishing [2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg 586,364 Crossref DOIs 201,507 [1] per month 1.5 million (papers + supplemental data) /year [citation needed]* each 3 mm thick  4500 m high per year [2] * Most is not Publicly readable [1] http://www.crossref.org/01company/crossref_indicators.html
  16. 16. Scientific and Medical publication (STM)[+] • World Citizens pay $450,000,000,000… • … for research in 1,500,000 articles … • … cost $300,000 each to create … • … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries … • … to “publishers” who forbid access to 99.9% of citizens of the world … • 85% of medical research is wasted (not published, badly conceived, duplicated, …) [Lancet 2009] [+] Figures probably +- 50 % [*] arXiV preprint server costs $7 USD per paper
  17. 17. What is “Content”? http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113 03&representation=PDF CC-BY SECTIONS MAPS TABLES CHEMISTRY TEXT MATH contentmine.org tackles these
  18. 18. ContentMine approaches 0. Open software, Open content, Open notebooks 1. Daily liberation of facts which are easy and widely useful. – Species (Bacillus subtilis, Okapia johnstoni) – Genes (BRCA1*, APOE) – Chemicals (acetone, CH3OH) – Identifiers (RRIDs, museum specimens, ) 1. CMunities of practice with bespoke tools: – Clinical Trials – Phylogenetic trees – Systematic reviews
  19. 19. http://chemicaltagger.ch.cam.ac.uk/ • Typical Typical chemical synthesis
  20. 20. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  21. 21. C) What’s the problem with this spectrum? Org. Lett., 2011, 13 (15), pp 4084–4087 Original thanks to ChemBark
  22. 22. After AMI2 processing….. … AMI2 has detected a square
  23. 23. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup DAILY CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts
  24. 24. http://opentrials.net/ ContentMine will work with OpenTrials
  25. 25. “adult nonpregnant patients, aged ≥18 years”, “randomization sequence using a permuted block design with random block sizes stratified by study center”. “blinding of the patients and caregivers is not possible”. “Investigators performing analysis are blinded for the intervention”. “Continuous normally distributed variables … mean and standard deviation, counts (n) and percentages (%). … Student’s t-test … or the Mann–Whitney U test … Categorical … Chi-square test or Fisher's exact tests. Statistical significance is considered to be at a P value <0.05 …” Formulaic language in reporting clinical trials
  26. 26. Text-based plugins • Bag of words (https://en.wikipedia.org/wiki/Bag-of- words_model) • https://en.wikipedia.org/wiki/Tf%E2%80%93idf (Term-frequency, inverse document frequency) • Templates and regexes (regular expressions).
  27. 27. “Bag of Words” Three fulltext articles from trialsjournal.com
  28. 28. Regular Expressions for Systematic Reviews of Animal Tests Preceding Text Following Text Extracted term In 30 minutes 6 scientists (most were unfamiliar with regex) wrote 200 regexes for ARRIVE (NC3R guidelines)
  29. 29. TEMPLATES
  30. 30. https://en.wikipedia.org/wiki/Consolidated_Standards_of_Reporting_Trials Some communities have standard Reporting, which helps extraction
  31. 31. Ln Bacterial load per fly 11.5 11.0 10.5 10.0 9.5 9.0 6.5 6.0 Days post—infection 0 1 2 3 4 5 Bitmap Image and Tesseract OCR
  32. 32. UNITS TICKS QUANTITY SCALE TITLES DATA!! 2000+ points
  33. 33. Dumb PDF CSV Semantic Spectrum 2nd Derivative Smoothing Gaussian Filter Automatic extraction
  34. 34. PLUTo
  35. 35. Aves Apterygidae Marsupialia Monotremata Mammalia Reptilia Amphibia Arthropoda Myriapodia Okapia johnstoni Pyrus Stuffed Tree of Life
  36. 36. https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now- mine-images-of-phylogenetic-trees-and-more/ for story of extraction Thinning Topology Serialization Newick
  37. 37. PMR’s Tribute Planned Memorial Meeting July 14th 2014 Cambridge OPEN NOTEBOOK SCIENCE
  38. 38. Traditional Research and Publication “Lab” work paper/th esis Write rewrite Re-experiment publish ??? Validation?? DATA output “belongs” to publisher
  39. 39. TOOLS Open Notebook Science Open engineered repository World community INSTRUMENT validate merge MODEL CODE DATA DATA knowledge calibrate Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous ; data are SEMANTIC Machines and humans Working together
  40. 40. Open Notebook Content Mining • “No insider knowledge” • Anyone can become involved • All raw non-copyright material on Github • Planning and discussion on Open Discourse • All output (however imperfect) on Github CC0 • Immediate upload • Inspired by Free/Libre/Open Source, Wikipedia, Open StreetMap.
  41. 41. 4300 images
  42. 42. “Root”
  43. 43. OCR (Tesseract) Norma (imageanalysis) (((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga _maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat um:167):217):11):9); Semantic re-usable/computable output (ca 4 secs/image)
  44. 44. Automatic Open Notebook of computations Everything is posted to Github before being analyzed
  45. 45. Bacillus subtilis [131238]* Bacteroides fragilis [221817] Brevibacillus brevis Cyclobacterium marinum Escherichia coli [25419] Filobacillus milosensis Flectobacillus major [15809775] Flexibacter flexilis [15809789] Formosa algae Gelidibacter algens [16982233] Halobacillus halophilus Lentibacillus salicampi [18345921] Octadecabacter arcticus Psychroflexus torquis [16988834] Pseudomonas aeruginosa [31856] Sagittula stellata [16992371] Salegentibacter salegens Sphingobacterium spiritivorum Terrabacter tumescens • [Identifier in Wikidata] • Missing = not found with Wikidata API 20 commonest organisms (in > 30 papers) in trees from IJSEM* Half do not appear to be in Wikidata Can the Wikipedia Scientists comment? *Int. J. Syst. Evol. Microbiol.
  46. 46. Supertree for 924 species Tree
  47. 47. Supertree created from 4300 papers
  48. 48. Minor branch
  49. 49. Part of major branch
  50. 50. Part of major branch
  51. 51. Ideas for Neuroscience Can we extract digital information from published electroneurophysiology traces?... …and build super-information?
  52. 52. Raw trace (pixels)
  53. 53. Thinned trace (pixels)
  54. 54. Line segments (SVG)
  55. 55. Reconstructed trace (SVG)
  56. 56. Extraction into data format (CSV, Excel)
  57. 57. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts
  58. 58. Peter Murray-Rust BMC publisher Blue Obelisk paper (20 co-authors) Sub-network From CATalog
  59. 59. Phytochemistry extraction O. dayi “volatile composition of “ A.sibeiri A. judaica Displayed by CAT (CottageLabs)
  60. 60. The problem ©
  61. 61. Prof. Ian Hargreaves (2011): "David Cameron's exam question”: "Could it be true that laws designed more than three centuries ago with the express purpose of creating economic incentives for innovation by protecting creators' rights are today obstructing innovation and economic growth?” “yes. We have found that the UK's intellectual property framework, especially with regard to copyright, is falling behind what is needed.” "Digital Opportunity" by Prof Ian Hargreaves - http://www.ipo.gov.uk/ipreview.htm. Licensed under CC BY 3.0 via Wikipedia - https://en.wikipedia.org/wiki/File:Digital_Opportunity.jpg#/media/File:Digital_Opportunity.jpg
  62. 62. Elsevier wants to control Open Data [asked by Michelle Brook]
  63. 63. http://www.epip2015.org/copyright-wars-frozen-conflict/ UPDATE 20150902: Ian Hargreaves "the voices of the digital many should not be drowned out by the digital self-interested few"
  64. 64. contentmine.org team

×