Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 41

2 Likes

Share

Download to read offline

Scientific search for everyone

Download to read offline

The scientific and medical literature is a vast resource of knowledge, but it needs turning into semantic FAIR form. The ContentMine can do this and we presented a rapid overview of the potential

Related Books

Free with a 30 day trial from Scribd

See all

Scientific search for everyone

  1. 1. SES 2018, London, UK, 2018-09-03 Scientific Search for Everyone Peter Murray-Rust TheContentMine A new knowledgebase beyond journals Images from ContentMine CC BY and Wikimedia CC BY-SA pm286@cam.ac.uk peter@contentmine.org Why? How? Who?
  2. 2. (2x digital music industry!) ContentMine is OpenLocked Non-Profit http://contentmine.org The Right to Read is the Right to Mine
  3. 3. • Preprints, Unpaywall, Wikimedia, Repos, ContentMine, … offer a new generation of semantic (FAIR) science. • Closed access means people die. • Journal “publishing” divides the world • Scholarship is for all Citizens • I present a new generation of Citizen-based search tools beyond Journals
  4. 4. http://www.budapestopenaccessinitiative.org/read … an unprecedented public good. … … completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  5. 5. APCs and Journals MUST GO! arXiv bioRxiv chemRxiv 10$ Commercial publisher 1800$ Review Production Hosting Corporate Branding Marketing philanthropy Shareholder Profit
  6. 6. Publisher A inviting PMR to be on EdBoard: … no waiver for Global South or developing countries. We encourage authors around the world to publish high-quality papers … (1854 USD APC) We devote some of fees to support the academic development […] , e.g., making journal travel award .. ; best author/guest editor award for … special issue; Why APCs must go Publisher B answer to PMR on EdBoard: … [We don’t review or validate data; we expect reviewers to do that]
  7. 7. [1] The Military-Industrial-Academic complex (1961) (Dwight D Eisenhower, US President) Publishers Academia Glory+? $$, MS review Taxpayer Student Researcher $$ $$ in-kind The Publisher-Academic complex[1] Infrastructure “The scholarly poor”
  8. 8. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about- ebola.html We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection. Adage in public health: “The road to inaction is paved with research papers.” Bernice Dahn (chief medical officer of Liberia’s Ministry of Health) Vera Mussah (director of county health services) Cameron Nutt (Ebola response adviser to Partners in Health) A System Failure of Scholarly Publishing
  9. 9. Citizen Search (10 mins) • Tropical disease – ALL Open(“West Africa”)*(“Flavivirus”). EuropePMC – Index with disease, insect, country, insecticide • Ferromagnetism (arxiv) – 50 recent PDFs – Index magnetism, elements, crystallography
  10. 10. Also see http://freeourknowledge.org/ A recent initiative – the Fair Open Access Alliance – has developed a set of progressive criteria for journals which focus on two main objectives: regaining financial control of the academic publishing system, and supporting open access principles. In summary, these are: The journal has a transparent ownership structure, and is controlled by and responsive to the scholarly community. Authors of articles in the journal retain copyright. All articles are published open access and an explicit open access licence is used. Submission and publication is not conditional in any way on the payment of a fee from the author or its employing institution, or on membership of an institution or society. Any fees paid on behalf of the journal to publishers are low, transparent, and in proportion to the work carried out. see www.fairopenaccess.org
  11. 11. Beyond FAIR – FOAA and REACT 99% of journal articles are neither FAIR nor ethical. Fair Open Access FOAA PMR’s principles are similar Readers Equitable – unite the world Affordable Citizens Transparent
  12. 12. cc by-nc-sa license LabHack and Alliance Earth 1 APC = 1900 USD 1 bioreactor = 25 USD 1 Raspberry PI 55 USD 1 submission to bioRxiv Free (10 USD hidden) “a PCR machine in the UK is around £6000 but in Zimbabwe about $33000 - try convincing someone to pay APCs when they have to try and save for that.” CITIZENS! Zimbabwe. LabHack team from Harare Institute of Technology.
  13. 13. Guanyang Zhang  Biology, Arizona  „My ContentMine Fellowship project will focus on mining weevil-plant associations from literature records.“  „Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils (Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly 5% of all known animals.“  „Knowledge of host plant associations is critical for pest management, conservation, and comparative biological research. This knowledge is, however, scattered in 300 years of historical literature and difficult to access.“  Weevil-plant association network graph made with Google Fusion Table. Each blue circle is a weevil tribe and yellow circle a plant genus. The size of a circle represents the number of associations.
  14. 14. Neo Christopher Chung  Warsaw, Computational Biology  Wants to find out geographic and temporal differences in the use of genomic software tools
  15. 15. ContentMine Workshops on Mining Chris Kittel, CM, atMozfest 2015 Stefan Kasberger, CM
  16. 16. Julia Reda, Pirate MEP, running ContentMine software to liberate science 2016-04-16
  17. 17. Lars Willighagen  15 years old NL  Wants: extract data about conifers (relations to chemicals, height etc.)  Outcome: database with webpage containing conifer properties  Table Facts Visualiser DEMO  Card DEMO  Word Cloud  „ I applied to this fellowship to learn new things and combine the ContentMine with two previous projects I never got to finish, and I got really excited by the idea and the ContentMine at large.“
  18. 18. ContentMine goal • Read 10,000 – 100,000 papers every day • Make them semantic (where possible) • Index against WikiData • Extract semantic objects (chemistry, computations, multivariate tables) • Publish to Zenodo • Aggregate, Filter, Mix with other domains • MAKE SCIENCE AVAILABLE You can help – if you want
  19. 19. Semantic Fulltext • EuropePMC coherent OpenAccess • getpapers: query , download (through API). • AMI filters, checks[1], transforms facts in papers. • sequences, species, genera, genes, dictionaries [0] All operations shown run in total of <3 minutes. [1] Dictionaries and lookup. [2] Usable from home by anyone Zika endemic areas Wikimedia CC-BY-SA
  20. 20. Commonest species in 120 Zika papers 423 Ae./Aedes aegypti 333 Ae./Aedes albopictus 63 Ae. bromeliae 58 Ae. lilii 46 Ae. hensilli 42 Glossina pallidipes 40 Plasmodium vivax 35 Ae. luteocephalus 28 Ae. vittatus 25 Ae. furcifer 22 Plasmodium falciparum 21 Drosophila melanogaster pre=“fever (DHF), are caused by the world's most prevalent mosquito-borne virus. 37 DENV is carried by " exact="Aedes aegypti” post=" mosquito, which is strongly affected by ecological and human drivers, but also influenced by clima" name="binomial"/>
  21. 21. Download all Open Access “Zika” from EuropePMC in 10 seconds (click below for movie) Aedes aegypti, Wikimedia CC-BY-SA Note: movies of this and other slides can be seen at https://vimeo.com/154705161
  22. 22. 3011 virus 1939 Ae./Aedes 1212 dengue 901 mosquito/es 894 species 791 ZIKV 721 using 716 DENV 567 detection 513 aegypti 484 infection 442 RNA 428 protein 401 albopictus 360 viral Commonest words in 120 Zika papers Mosquito spp. Wikimedia CC-BY-SA
  23. 23. https://www.wikidata.org/wiki/Wikidata:WikiFactMine ContentMine thanks the WikimediaFoundation for support 15 million articles, over 200 dictionaries
  24. 24. What is “Content”? http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113 03&representation=PDF CC-BY SECTIONS MAPS TABLES CHEMISTRY TEXT MATH contentmine.org tackles these
  25. 25. Automatic semantic markup of chemistry Could be used for analytical, crystallization, etc.
  26. 26. But we can now turn PDFs into Science We can’t turn a hamburger into a cow Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE
  27. 27. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY: AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-frame animation showing how the 12 reactions lead into each other CLICK HERE FOR ANIMATION (may be browser dependent)
  28. 28. UNITS TICKS QUANTITY SCALE TITLES DATA!! 2000+ points
  29. 29. Dumb PDF CSV Semantic Spectrum 2nd Derivative Smoothing Gaussian Filter Automatic extraction
  30. 30. Modern Diagram Mining 4500 separate images Phylogenetic tree supertree A machine-compiled microbial supertree from figure-mining thousands of papers, Ross Mounce, Peter Murray- Rust, Matthew A Wills, 2017 https://riojournal.com/article/ 13589/
  31. 31. FAIR?
  32. 32. @Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism: "Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevi er-stopped-me-doing-my-research/ … #opencon #TDM Elsevier stopped me doing my research Chris Hartgerink
  33. 33. I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress. To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1]. In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers. Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day. Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university. I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research. [1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2 Chris Hartgerink’s blog post
  34. 34. All the world’s 5 million FAIR Open Scientific articles (* 0.1 MB = 0.5 TB), indexed by ContentMine . Disk 30 GBP Raspberry Pi3. 50 GBP CC BY, PeterMR Disk Raspberry PI Power
  35. 35. bioRxiv in Citizen Health Search (CHS) A proposal to Wellcome Trust ( Open Research in Health call) with ContentMine, Cochrane and UCL-EPPI (CCU) CHS puts semantic search on the desktop of the searcher. We index all the visible Medical literature, normalize, section and index against a bank of user-chosen dictionaries. CHS takes input from EPMC, bioRxiv and emerging community sources such as Crossref, unpaywall and outputs to Zenodo, Wikidata and CM-Science Source. Citizen Dashboard
  36. 36. Gene-species co-occurrence in Marchantia from bR. PDF Semantic Sectioned Indexed HTML Dashboard West Africa Flavivirus 600 hits download PDF->SVG->HTML Dictionary etc. search Species-gene cooccurrence Automatic! Gene, country, organiz, plantparts, species Synoptic view Author,abstract, kword, abbrev, intro
  37. 37. Results of searching for “ferromagnetism” on arxiv 201806-201808 And > 100 more arxiv compchem country crystal element orgs magnetism Bag of words
  38. 38. Results from “Zika+WestAfrica”
  39. 39. “Madeira” “popular tourist destination” “Insecticide Resistance” “Flower pots”
  40. 40. Question • “How can I help”
  • ChantelWalker4

    Nov. 29, 2021
  • jimdowning

    Sep. 25, 2018

The scientific and medical literature is a vast resource of knowledge, but it needs turning into semantic FAIR form. The ContentMine can do this and we presented a rapid overview of the potential

Views

Total views

1,480

On Slideshare

0

From embeds

0

Number of embeds

42

Actions

Downloads

16

Shares

0

Comments

0

Likes

2

×