SlideShare a Scribd company logo
Challenges and Guidelines for Reproducible Research
and Interactive Education with Jupyter Notebook
Shih-Cheng Huang, Niema Moshiri, Michael Reich, Peter W. Rose
In [2]:
In [3]:
Introduction
Jupyter Notebooks have the potential to make research more
reproducible. However, in practice, many notebooks fall short of this
promise. Here we identify challenges and propose guidelines to
organize, document, and deploy notebooks to increase reproducibility
and reusability. These guidelines also apply to instructional materials.
Jupyter Notebooks are extensively used in research and education.
We recently organized a workshop at UC San Diego and invited
students, postdocs, research scientists, and faculty to share
experiences and identify challenges of using Jupyter notebooks. The
workshop covered the use of Jupyter in many disciplines, ranging from
Astrophysics, Bioinformatics, Datascience, Genomics, Medicine, to
Structural Bioinformatics, as well as classroom use with hundreds of
students and publication of reproducible science. Here we present a
summary of our findings in the form of guidelines.
Tell a
Story
•Show how you got from
initial data to final results
•Avoid manual steps, use a
notebook for data
preparation
•Split workflow into steps,
e.g., data preparation,
data analysis, data
visualization
•Show intermediate results
so users can follow the
dataflow
•Challenges
How to combine
multiple notebooks into
a complex workflow
Capture
Entire
Workflow
•Like any good story, a
notebook should have a
• Beginning
• Introduce the topic and the
aims of the notebook
• Middle
• Explain the steps of the
workflow
• Use markdown to split
notebook into sections
• End
• Interpret the results
• What trends do tables, plots,
or figures show?
•Write notebook for an
audience
• New users may need
instructions how to run a
notebook
• Educational materials need
background and step by step
explanations
In [1]:
Avoid
Copy &
Paste
•Split common functions
into separate files and
import them into
notebooks
•Challenges
•Need tools to refactor
notebooks
•Find and extract
common code among
related notebooks
Remove
Clutter
In [4]:
•Use markdown to organize a
notebook into sections
•Split long notebooks into a
series of notebooks
• Keep a top level notebook with
links to the individual notebooks
•Avoid long cells
• Split text and code into cells
• One cell -> one paragraph or
one task (e.g., create a plot)
•Modularize code by defining
functions or classes
•Reuse code by importing
functions or classes
•Put low-level documentation in
code comments
•Challenges
• Need to be able to collapse
sections of text or code to hide
low-level details, e.g., setup for
a plot
In [5]:
Make it
Repro-
ducible
•Specify version numbers of
dependencies
•Always specify the
source/location of the data
•Keep a copy of the raw data if
possible
•Make copies of data in a
notebook to avoid corrupting
datasets when running cells out
of order
•Specify random number seeds
•Before saving a notebook, rerun
notebook to ensure linear
execution order
•Challenges
• Need functionality to freeze
current state, execution order
In [7]:
Use
Version
Control
•Keep your notebooks under
version control, e.g., GitHub
•Describe content of repository in
README files
•Specify a license to encourage
and enable use by others
•Structure your repository
• See for example
http://drivendata.github.io/cookie
cutter-data-science/
•Challenges
• Each time notebook is run, the
ipynb file is modified,
• Need a ”diff” tool for notebooks
In [6]:
Share
It!
•Use open source projects as your
dependencies
•Add a liberal open source license
(e.g., MIT, Apache 2) to your
repository
•Use Nbviewer to provide static
views of your executed notebook
https://nbviewer.jupyter.org/
•Use Binder to provide a zero-
install environment to run your
notebooks in the cloud
https://mybinder.org/
•Create a Docker image of your
environment
https://docs.docker.com/
•Challenges
• Data intensive applications
• Compute intensive applications
• Special hardware requirements,
e.g., GPU
• Multi-step workflows
References
- Jupyter Notebooks – a publishing format for reproducible computational workflows
(2016) Jupyter Dev. Team, IOS Press, doi: 10.3233/978-1-61499-649-1-87
- Exploration and Explanation in Computational Notebooks, A. Rule, et al. (2018) Proc.
of the 2018 CHI Conference on Human Factors in Computing Systems, ACM.
- Binder 2.0 - Reproducible, interactive, sharable environments for science at scale,
Project Jupyter, et al. (2018) Proc. of the 17th Python in Science Conf. (SCIPY 2018).
- The GenePattern Notebook Environment, M. Reich, et al. (2017) Cell Systems 5.2,
149-151.
Acknowledgements
Amanda Birmingham, Ilkay Altintas, Rob Knight, Tiago Leao, Nathan Mih, Mai Nguyen,
Shweta Purawat, Brin Rosenthal, Adam Rule, Britton Smith, Shuai Tang, Guorong Xu
Image credit: 1. http://higher-ed.us/wp-content/uploads/2017/12/copy-and-paste-pictures-10-unbelievable-why-a-blogger-should-never-webmasters-nigeria.jpg, 2. https://blog.prototypr.io/meet-overflow-9b2d926b6093, 3. http://romeo.landinez.co/workflow/rapid-workflow-protoype.html, 4. https://productivitysteps.files.wordpress.com/2016/09/clutter.jpg
In [*]:
We are crowdsourcing a Jupyter Guide for
reproducible research. Please help us at:
https://github.com/sbl-sdsc/jupyter-guide

More Related Content

Similar to Challenges and Guidelines for Reproducible Research with Jupyter Notebook

Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning StudioIntroduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
Designing and prototyping
Designing and prototypingDesigning and prototyping
Designing and prototyping
Andres Baravalle
 
See to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquirySee to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquiry
Deirdre Costello
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Bertram Ludäscher
 
EDU 749 Emerging Trends in Technology
EDU 749 Emerging Trends in TechnologyEDU 749 Emerging Trends in Technology
EDU 749 Emerging Trends in Technology
gibb0
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Anita de Waard
 
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for ReproducibilityRob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
GigaScience, BGI Hong Kong
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
dhaferbenali1
 
Ariadne: Lifecycles
Ariadne: LifecyclesAriadne: Lifecycles
Ariadne: Lifecycles
ariadnenetwork
 
Software Programming with Python II.pptx
Software Programming with Python II.pptxSoftware Programming with Python II.pptx
Software Programming with Python II.pptx
GevitaChinnaiah
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
Coursera data science specialization
Coursera data science specializationCoursera data science specialization
Coursera data science specialization
Mengshu Liu
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityTanu Malik
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community Responses
Daniel S. Katz
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
Stephen Turner
 
OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
seanb
 
Jupyter notebooks on steroids
Jupyter notebooks on steroidsJupyter notebooks on steroids
Jupyter notebooks on steroids
Jose Enrique Ruiz
 
Computable Content
Computable ContentComputable Content
Computable Content
Paco Nathan
 
USING JUPYTERHUB IN THE CLASSROOM: SETUP AND LESSONS LEARNED
USING JUPYTERHUB IN THE CLASSROOM: SETUP AND LESSONS LEARNEDUSING JUPYTERHUB IN THE CLASSROOM: SETUP AND LESSONS LEARNED
USING JUPYTERHUB IN THE CLASSROOM: SETUP AND LESSONS LEARNED
ijseajournal
 

Similar to Challenges and Guidelines for Reproducible Research with Jupyter Notebook (20)

Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning StudioIntroduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
 
Designing and prototyping
Designing and prototypingDesigning and prototyping
Designing and prototyping
 
Designing e-Learning Objects
Designing e-Learning ObjectsDesigning e-Learning Objects
Designing e-Learning Objects
 
See to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquirySee to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquiry
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
EDU 749 Emerging Trends in Technology
EDU 749 Emerging Trends in TechnologyEDU 749 Emerging Trends in Technology
EDU 749 Emerging Trends in Technology
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
 
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for ReproducibilityRob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
 
Ariadne: Lifecycles
Ariadne: LifecyclesAriadne: Lifecycles
Ariadne: Lifecycles
 
Software Programming with Python II.pptx
Software Programming with Python II.pptxSoftware Programming with Python II.pptx
Software Programming with Python II.pptx
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Coursera data science specialization
Coursera data science specializationCoursera data science specialization
Coursera data science specialization
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for Repeatability
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community Responses
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
 
Jupyter notebooks on steroids
Jupyter notebooks on steroidsJupyter notebooks on steroids
Jupyter notebooks on steroids
 
Computable Content
Computable ContentComputable Content
Computable Content
 
USING JUPYTERHUB IN THE CLASSROOM: SETUP AND LESSONS LEARNED
USING JUPYTERHUB IN THE CLASSROOM: SETUP AND LESSONS LEARNEDUSING JUPYTERHUB IN THE CLASSROOM: SETUP AND LESSONS LEARNED
USING JUPYTERHUB IN THE CLASSROOM: SETUP AND LESSONS LEARNED
 

Recently uploaded

Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 

Recently uploaded (20)

Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 

Challenges and Guidelines for Reproducible Research with Jupyter Notebook

  • 1. Challenges and Guidelines for Reproducible Research and Interactive Education with Jupyter Notebook Shih-Cheng Huang, Niema Moshiri, Michael Reich, Peter W. Rose In [2]: In [3]: Introduction Jupyter Notebooks have the potential to make research more reproducible. However, in practice, many notebooks fall short of this promise. Here we identify challenges and propose guidelines to organize, document, and deploy notebooks to increase reproducibility and reusability. These guidelines also apply to instructional materials. Jupyter Notebooks are extensively used in research and education. We recently organized a workshop at UC San Diego and invited students, postdocs, research scientists, and faculty to share experiences and identify challenges of using Jupyter notebooks. The workshop covered the use of Jupyter in many disciplines, ranging from Astrophysics, Bioinformatics, Datascience, Genomics, Medicine, to Structural Bioinformatics, as well as classroom use with hundreds of students and publication of reproducible science. Here we present a summary of our findings in the form of guidelines. Tell a Story •Show how you got from initial data to final results •Avoid manual steps, use a notebook for data preparation •Split workflow into steps, e.g., data preparation, data analysis, data visualization •Show intermediate results so users can follow the dataflow •Challenges How to combine multiple notebooks into a complex workflow Capture Entire Workflow •Like any good story, a notebook should have a • Beginning • Introduce the topic and the aims of the notebook • Middle • Explain the steps of the workflow • Use markdown to split notebook into sections • End • Interpret the results • What trends do tables, plots, or figures show? •Write notebook for an audience • New users may need instructions how to run a notebook • Educational materials need background and step by step explanations In [1]: Avoid Copy & Paste •Split common functions into separate files and import them into notebooks •Challenges •Need tools to refactor notebooks •Find and extract common code among related notebooks Remove Clutter In [4]: •Use markdown to organize a notebook into sections •Split long notebooks into a series of notebooks • Keep a top level notebook with links to the individual notebooks •Avoid long cells • Split text and code into cells • One cell -> one paragraph or one task (e.g., create a plot) •Modularize code by defining functions or classes •Reuse code by importing functions or classes •Put low-level documentation in code comments •Challenges • Need to be able to collapse sections of text or code to hide low-level details, e.g., setup for a plot In [5]: Make it Repro- ducible •Specify version numbers of dependencies •Always specify the source/location of the data •Keep a copy of the raw data if possible •Make copies of data in a notebook to avoid corrupting datasets when running cells out of order •Specify random number seeds •Before saving a notebook, rerun notebook to ensure linear execution order •Challenges • Need functionality to freeze current state, execution order In [7]: Use Version Control •Keep your notebooks under version control, e.g., GitHub •Describe content of repository in README files •Specify a license to encourage and enable use by others •Structure your repository • See for example http://drivendata.github.io/cookie cutter-data-science/ •Challenges • Each time notebook is run, the ipynb file is modified, • Need a ”diff” tool for notebooks In [6]: Share It! •Use open source projects as your dependencies •Add a liberal open source license (e.g., MIT, Apache 2) to your repository •Use Nbviewer to provide static views of your executed notebook https://nbviewer.jupyter.org/ •Use Binder to provide a zero- install environment to run your notebooks in the cloud https://mybinder.org/ •Create a Docker image of your environment https://docs.docker.com/ •Challenges • Data intensive applications • Compute intensive applications • Special hardware requirements, e.g., GPU • Multi-step workflows References - Jupyter Notebooks – a publishing format for reproducible computational workflows (2016) Jupyter Dev. Team, IOS Press, doi: 10.3233/978-1-61499-649-1-87 - Exploration and Explanation in Computational Notebooks, A. Rule, et al. (2018) Proc. of the 2018 CHI Conference on Human Factors in Computing Systems, ACM. - Binder 2.0 - Reproducible, interactive, sharable environments for science at scale, Project Jupyter, et al. (2018) Proc. of the 17th Python in Science Conf. (SCIPY 2018). - The GenePattern Notebook Environment, M. Reich, et al. (2017) Cell Systems 5.2, 149-151. Acknowledgements Amanda Birmingham, Ilkay Altintas, Rob Knight, Tiago Leao, Nathan Mih, Mai Nguyen, Shweta Purawat, Brin Rosenthal, Adam Rule, Britton Smith, Shuai Tang, Guorong Xu Image credit: 1. http://higher-ed.us/wp-content/uploads/2017/12/copy-and-paste-pictures-10-unbelievable-why-a-blogger-should-never-webmasters-nigeria.jpg, 2. https://blog.prototypr.io/meet-overflow-9b2d926b6093, 3. http://romeo.landinez.co/workflow/rapid-workflow-protoype.html, 4. https://productivitysteps.files.wordpress.com/2016/09/clutter.jpg In [*]: We are crowdsourcing a Jupyter Guide for reproducible research. Please help us at: https://github.com/sbl-sdsc/jupyter-guide