Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources

166 views

Published on

In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources

  1. 1. Alasdair J G Gray A.J.G.Gray@hw.ac.uk www.macs.hw.ac.uk/~ajg33 @gray_alasdair Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources
  2. 2. Reproducibility Crisis 9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 2 Images from: https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
  3. 3. 9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 3 Computation Notebook • Literate programming: combines – Analysis: computation environment – Narrative: explanatory text • Cross-discipline take-up: – Astronomy – Biology – Oceanography • Gravitational Waves – http://mybinder.org/repo/losc- tutorial/LOSC_Event_tutorial https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005425
  4. 4. Aim Use a computation notebook to: 1. Perform an analysis over Semantic Web resources – Reproduce an analysis performed through website – Exploit recent Guide to Pharmacology RDF data publication and other Linked Open Data endpoints 2. Publish the analysis for ease of reproducibility 3. Embed semantics into the notebook 9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 4
  5. 5. Pharmacology Analysis to Reproduce 9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 5 • Using PubChem • Compound count in several datasets – ChEBI – ChEMBL – DrugBank – GtP • Intersection of compounds across datasets – Results reproduced 15 March 2018
  6. 6. Developed Jupyter Notebook 9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 6
  7. 7. Analysis Results Dataset PubChem (2018-03-15) SPARQL (2018-06-08) SPARQL (2018-07-24) SPARQL (2018-10-01) PubChem (2018-10-01) ChEBI 91,407 184,393 90,510 90,510 92,367 ChEMBL 1,729,327 1,820,035 1,820,035 1,820,035 1,821,997 DrugBank 9,789 6,810 6,810 6,810 9,823 Guide to Pharmacology 6,969 7,065 7,146 7,235 7,249 Intersection 1,523 -- -- -- 1,547 9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 7 • PubChem – Receives regular updates – ChEMBL count doesn’t correspond to release notes • ChEBI RDF – Accessed through OLS – Issued: 2018-01-01 – Double load in June? • ChEMBL RDF – Quarterly update: last release 2018-04-23 – Count corresponds to release notes • DrugBank RDF – Last update: 2014-07-25 • Guide to Pharmacology – Regular updates • Intersection – Unable to compute over RDF
  8. 8. Jupyter Notebook Experience • Easy to interlace explanation and code • Writing style: – Papers tend to be formal – Code explanation informal • How to represent results at time of writing vs live results? – Used static table • Embed myBinder link • No referencing support (out of the box) – cite2c plugin: https://github.com/takluyver/cite2c • No standard metadata – Metadata not displayed – No markup, e.g. ORCID • Couldn’t include environment details • Generating HTML using print dialogue – LaTeX generation didn’t work 9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 8
  9. 9. Conclusions Use a computation notebook to: 1. Perform an analysis over Semantic Web resources – Reproduce an analysis performed through website – Exploit recent Guide to Pharmacology RDF data publication and other Linked Open Data endpoints 2. Publish the analysis for ease of reproducibility – https://mybinder.org/v2/gh/AlasdairGray/SemSci2018/master?filepath=SemSci2018%20Publication.ipynb 3. Embed semantics into the notebook 9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 9 Alasdair J G Gray A.J.G.Gray@hw.ac.uk www.macs.hw.ac.uk/~ajg33 @gray_alasdair

×