(presented at Big Data Spain 2016 and at Strata+HW Singapore 2016)
Project Jupyter is the evolution of iPython notebooks, applied to a range of different programming languages and environments. If you have not worked with Jupyter notebooks yet, here is a quick hands-on introduction. If you have already, this tutorial will also explore how Jupyter and Docker used together provide what Prof. Lorena Barba has called "Computable Content".
We will work through brief exercises that show how to use Jupyter notebooks, based an example application for natural language processing in Python. We will use Launchbot.io for preparing containers and notebooks locally. In other words, editing on a laptop prior to working at scale using Mesos or other cluster managers. We will walk through the system architecture used at O'Reilly Media to combine Apache Mesos, Marathon, Docker, and Jupyter. Then we will take in-depth look at how Jupyter is being used in industry, and consider its impact on data science, software engineering, and science in academia.
Project Jupyter is the evolution of iPython notebooks,
applied to a range of different programming languages
Activate the environment needed:
source activate py3k
An example notebook (requires installs; see notes):
Installation and launch using Anaconda
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
from textblob import TextBlob
blob = TextBlob(text)
Installation and launch using Anaconda
At its core, one can think of Jupyter as a suite
of network protocols:
Jupyter is to the remote semantics of a REPL
HTTP is to the remote semantics of ﬁle share
A suite of network protocols
Jupyter @ O’Reilly Media
Embracing Jupyter Notebooks at O'Reilly
Learn alongside innovators, thought-by-thought, in context
Oriole Online Tutorials
How Do You Learn?
• A unique new medium blends code,
data, text, and video into a narrated
learning experience with computable
• Purely browser-based UX; zero
• Substantially higher engagement
• Opens the door for live coding
• GitHub lists over 300K public
Regex Golf by Peter Norvig
O’Reilly needed a way for authors to use Jupyter notebooks to create
professional publications. We also wanted to integrate video narration
into the UX. The result is a unique new medium called Oriole:
• Jupyter notebooks are used in the middleware
• each viewer gets a 100% HTML experience
(no download/install needed)
• context as a “unit of thought”
• the code and video are sync’ed together
• each web session has a Docker container running in the cloud
Innovators in programming, data science, dev ops, design, etc., tend to
be really busy people. Tutorials are now much quicker to publish than
“traditional” books and videos. The audience gets direct, hands-on,
contextualized experience across a wide variety of programming
A notebook, a container, and ~20 minutes of
informal video walk into a bar...
Literate Programming, Don Knuth
Instead of telling computers what to do, tell other
people what you want the computers to do
Wolfram Research introduced notebooks in 1988
for working with Mathematica…
PyCon 2016 Keynote, Lorena Barba
Highly recommended: speech acts (based
on Winograd and Flores) as theory for what
we’re doing here
• focus on a concise “unit of thought”
• invest the time and editorial effort to create a good intro
• keep your narrative simple and reasonably linear
• “chunk” the text and code into understandable parts
• alternate between text, code, output, further links, etc.
• use markdown for interesting links: background, deep-dive, etc.
• code cells shouldn’t be long (< 10 lines), must show output
• load data+libraries from the container, not the network
• clear all output then “Run All” – or it didn’t happen
• video narratives: there’s text, and there’s subtext...
• pause after each “beat” – smile, breathe, let people follow you
Tips learned by teaching with Jupyter
For the JVM people: stop thinking only about IDEs, Ivy, Maven, etc. (ibid, Knuth1984)
BUILD UBER JARS, LOAD LIBS FROM CONTAINER, NOT THE NETWORK!
(apologies for shouting)
Jupyter notebooks + Git repos provide a low-cost,
pragmatic way toward the practice of repeatable
science – in this case, repeatable Data Science
• executable documents
• code + params + results + descriptions
• shareable insights
Notebooks: a cure for silos
In data science, we see the benefits to teams for shared
insights, storytelling, etc.
Meanwhile domain expertise is generally more important than
knowledge about tools
There’s a value for developers to use notebooks in lieu of IDEs
in some cases – what are those cases?
GitHub now renders notebooks, so they can be used for
documentation, reporting, etc.
Digital Object Identifiers (DOI) can be assigned through
Zenodo, making notebooks citable for academic publication
“Sharing is caring”
Launchbot allows a notebook author to build a
container that includes the required Jupyter kernel,
installed libraries, datasets, etc.
You need to have Docker installed on your laptop
The backend uses Git and DockerHub to manage
For scale, deploy to DC/OS
Just Enough Math
monthly newsletter for updates,
events, conf summaries, etc.: