Docker for data science

Docker for Data Science
Down with package managers,upwith docker
Calvin Giles- calvin.giles@gmail.com- @calvingiles

Who knows what docker is?
Who uses docker?

Who am I?
APhysicist -MPhys fromUniversity of Southampton
AData Scientist at Adthena
APyData meetupand conference co-organiser
Data Science Advisorforuntangleconsulting.io
Programming in python fornearly 10years
Using dockerfor3months

Who am I not?
Acomputerscientist
DevOps
Adockerexpert
Adockercontributor

My Problem
Imaintained a dev_setup.md document.It was 450lines long and growing.

It got worse
Ruby
Aprojectrequiringrubywasn'tsupportedbyMacPorts.iwouldhavetoinstalloutsideofmy
packagemanager.

Then Idecided to contribute to sklearn.
Followed by many build errors.
The quickest solution -disable macports.
$ git clone git://github.com/scikit-learn/scikit-learn.git
$ python setup.py build_ext --inplace

When Ire-enabled MacPorts,it was neverthe same again.

With my environment in tatters...
...and faced with re-installing fromscratch,Idecided there must be a betterway.
What about:
Homebrew
Boxen
virtualenv
anaconda
npm,rpm
vagrant,chef,puppet,ansible
VirtualBox,VMFusion
Docker,CoreOS Rocket
fig,dokku,flynn,deis
Surely one of these would help?

What do I want in a solution?
Trivialto wipe the slate clean and recreate
Portable (home laptopenv ==work laptopstate)
Easy to share
Configure once,use everywhere
Remote databases,servers etc.
Customisation (sublime,.virc,.bashrcetc.)
Installation quirks
No system-wide backuprequired
Compatible with deployment to servers
OS X-centric

Introducing Docker!!
boot2docker-a single virtualmachine running on VirtualBox(OS XorWindows)
dockerdaemon running on boot2dockeros
dockercontainers running in partialisolation inside the same boot2dockervirtual
machine
dockerclient running on the host (OS X)to simplify the issuing of commands
dockerimages as templates forcontainers
dockerimages ->.isostyle templates
dockercontainers ->lightweight virtualmachines intended to run justone process each

What do you get with docker?
Run multiple environments indipendently
Run services indipendently of environments e.g.databases
Permit an environment to interact with a specific subset of the host files
Share a poolof resources between allenvironments
Asingle containercan consume 100%of CPU,RAMand HDD
Quotas forwhen resources busy

What can be problematic
Trust -processes are given read-write access to yourfiles
stick to trusted builds and automated builds
not really different to installing any software
Resources are limited to VMallocation
Lot's to learn
Managing containers (starting,stopping etc.)

Get docker
boot2docker.dmgor.exe
apt-get install docker.io...
Initialise:
https://docs.docker.com/installation/ (https://docs.docker.com/installation/)
$ boot2docker init
$ boot2docker start
$ $(boot2docker shellinit)
$ docker login

What can you do?
Startanipythonshell:
$dockerrun-it--rmipython/ipythonipython

What can you do?
Runapythonscriptinanipythonshellwiththescipystack:
$dockerrun
-it--rm
-v$(pwd):/home
-w=/home
ipython/scipystack
ipythonmy-script.py

What can you do?
Runanotebookserver:
$dockerrun-ePASSWORD=MyPass-it--rmipython/scipyserver

What can you do?
Converta.ipynbfileintoreveal.jsslides(likethese)andservethem:
$dockerrun
-i-t--rm
-p8000:8000
-v"$(pwd)":/slides
nbreveal
/convert_and_reveal.sh'DockerforDataScience.ipynb'

What can you do?
Startacompleteenvironment:
$cd~/fig/data-science-env
$figup-d

What do all these arguments do?
-drun as daemon
-i -t --rmrun interactively and auto-remove on exit
-eset an env variable
-pmapa port like -p host:container
-vmapa host volume into the container
--linkautomatically link containers,particularly databases
-wset the working directory
--volumes-frommapallthe volumes fromthe named container

Where do containers come from?
Containers can be started using:docker run <image>.

Where do images come from?
The trusted builds on dockerhub (like ubuntu,postgres,node etc.)
Open source providers with automated builds (like ipython,julia etc.)
Public images uploaded in a built state (quite opaque)
Private images (built locally orvia docker login)

How do I build my own images?
Write a Dockerfile.
Eitherbuild and run it locally like:
Orupload it to github and have the dockerhub build it foryou automatically:
Wait forbuild...
$ docker build -t calvingiles/magic-image .
$ docker run calvingiles/magic-image
$ git push
$ docker run calvingiles/magic-image

What is this Dockerfile?
FROM ipython/scipyserver
MAINTAINER Calvin Giles <calvin.giles@gmail.com>
# Create install folder
RUN mkdir /install_files
# Install postgres libraries and python dev libraries
# so we can install psycopg2 later
RUN apt-get update
RUN apt-get install libpq-dev python-dev
# install python requirements
COPY requirements.txt /install_files/requirements.txt
RUN pip2 install -r /install_files/requirements.txt
# Set the working directory to /notebooks
WORKDIR /notebooks

Components of a Dockerfile
FROM:anotherimage to build upon (ubuntu,debian,ipython...)
RUN:execute a command in the containerand write teh results into the image
COPY:copy a file fromthe build filesystemto the image
WORKDIR:change the working directory (the containerstarts in the last WORKDIR)
ENV:set and env variable
EXPOSE:open upa port to linked containers and the host

So how do I actually use docker?
Find an image to start yourenvironment off (ubuntu,ipython/scipystack,
rocker/rstudio)
Create a Dockerfilecontaining only a FROMline:
build and run
FROM ipython/scipystack

Let's start with the ipython notebook serverwith scipystack:
Find yourboot2docker ip:
Navigate there https://your-ip:443and sign in with the PASSWORD
$ echo 'FROM ipython/scipyserver' > Dockerfile
$ docker build -t ipython-dev-env .
$ docker run -i -t --rm -e PASSWORD=MyPass -p 443:8888 ipython-dev-env
$ boot2docker ip

Install an extra python module into a notebook server
Test the installof the package you want:
In[5]: !pip3searchgensim
gensim -PythonframeworkforfastVectorSpaceModelling
In[7]: !pip3installgensim
Downloading/unpackinggensim
Downloadinggensim-0.10.3.tar.gz(3.1MB):3.1MBdownloaded
Runningsetup.py(path:/tmp/pip_build_root/gensim/setup.py)egg_infoforpackagegensim
warning:nofilesfoundmatching'*.sh'underdirectory'.'
nopreviously-includeddirectoriesfoundmatching'docs/src*'
Requirementalreadysatisfied(use--upgradetoupgrade):numpy>=1.3in/usr/local/lib/pyt
hon2.7/dist-packages(fromgensim)
Requirementalreadysatisfied(use--upgradetoupgrade):scipy>=0.7.0in/usr/local/lib/p
ython2.7/dist-packages(fromgensim)
Requirementalreadysatisfied(use--upgradetoupgrade):six>=1.2.0in/usr/lib/python2.7
/dist-packages(fromgensim)
Installingcollectedpackages:gensim
Runningsetup.pyinstallforgensim
warning:nofilesfoundmatching'*.sh'underdirectory'.'
nopreviously-includeddirectoriesfoundmatching'docs/src*'
building'gensim.models.word2vec_inner'extension
x86_64-linux-gnu-gcc-pthread-fno-strict-aliasing-DNDEBUG-g-fwrapv-O2-Wall-Wstr
ict-prototypes-fPIC-I/tmp/pip_build_root/gensim/gensim/models-I/usr/include/python2.7-
I/usr/local/lib/python2.7/dist-packages/numpy/core/include-c./gensim/models/word2vec_inn
er.c-obuild/temp.linux-x86_64-2.7/./gensim/models/word2vec_inner.o
Infileincludedfrom/usr/include/python2.7/numpy/ndarraytypes.h:1761:0,
from/usr/include/python2.7/numpy/ndarrayobject.h:17,
from/usr/include/python2.7/numpy/arrayobject.h:4,
from./gensim/models/word2vec_inner.c:232:
/usr/include/python2.7/numpy/npy_1_7_deprecated_api.h:15:2:warning:#warning"Usingd
eprecatedNumPyAPI,disableitby""#definingNPY_NO_DEPRECATED_APINPY_1_7_API_VERSION"
[-Wcpp]

In [10]: import gensim
Ifthatworks,addtheinstallcommandstoyourDockerfile:
Andrebuild:
FROMipython/scipyserver
RUNpip2installgensim
RUNpip3installgensim
$dockerbuild-tipython-dev-env.
$dockerrun-i-t--rm-ePASSWORD=MyPass-p443:8888ipython-dev-env

I want to use a requirements.txt file
Create requirement.txt
Dockerfile:
$ echo 'gensim' >> requirements.txt
COPY requirements.txt /requirements.txt
RUN pip2 install -r /requirements.txt
RUN pip3 install -r /requirements.txt

But what do I put in requirements.txt?
In[12]: !pip3freeze|head
Cython==0.20.1post0
Jinja2==2.7.2
MarkupSafe==0.18
Pillow==2.3.0
Pygments==1.6
SQLAlchemy==0.9.8
Sphinx==1.2.2
brewer2mpl==1.4.1
certifi==14.05.14
chardet==2.0.1

How do I install system libraries for MSSQL Server?
Create yourodbcinst.ini,odbc.iniand freetds.conffiles.
RUN apt-get update &&& apt-get -y install
unixodbc
unixodbc-dev
freetds-dev
tdsodbc
COPY freedts.conf >> /etc/freetds/
COPY odbcinst.ini /etc/
COPY odbc.ini /etc/

How do I install the PyODBC library from source?
RUN pip2 install https://pyodbc.googlecode.com/files/pyodbc-3.0.7.zip

How do I get a database?
You willget the IP and PORTS to connect to as env variables in the ipython-dev-env container
$ docker run -d --name dev-postgres postgres
$ docker run -d
-e PASSWORD=MyPass
-p 443:8888
--link dev-postgres:dev-postgres
ipython-dev-env

What about my data?
$ docker run -d
-v "~/Google Drive/data:/data"
--name gddata
busybox echo
$ docker run -d
-e PASSWORD=MyPass
-p 443:8888
--volumes-from gddata
ipython-dev-env

Help, I ran out of RAM
$ VBoxManage modifyvm boot2docker-vm --memory 5555
$ boot2docker stop
$ boot2docker start
$ boot2docker info { ... 'Memory': 5555 ... }

Git push?
In Dockerhub,create new and select a Automated Build.
Point it to yourgithub orbitbucket repo
Wait forthe build to complete
$ docker pull calvingiles/data-science-environment
$ docker run calvingiles/data-science-environment

I seem to be running a lot of containers
Fig can helpa lot with that.
Installfig:
Create a fig.ymlfile specifying a set of containers to start
fig up -dto begin
fig.sh/install.html(fig.sh/install.html)

Is this all really better than before?
Iusedockerfor100%ofmydatasciencetasks.
Iusedockerfornearlyeverythingelse.

MAINTAINER Calvin Giles <calvin.giles@gmail.com>
# Create install folder
RUN mkdir /install_files
# Update aptitude with new repo
RUN apt-get update
# Install software
RUN apt-get install -y git
# Make ssh dir
RUN mkdir /root/.ssh/
## Authenticate with github
# Copy over private key, and set permissions
COPY id_rsa /root/.ssh/id_rsa
RUN chmod 600 /root/.ssh/id_rsa
# Create known_hosts
RUN touch /root/.ssh/known_hosts
# Add github key
RUN ssh-keyscan github.com >> /root/.ssh/known_hosts
## install pyodbc so we can talk to MS SQL
# install unixodbc and freetds
RUN apt-get -y install unixodbc unixodbc-dev freetds-dev tdsodbc
# configure Adthena database with read-only permissions
COPY freetds.conf.suffix /install_files/freedts.conf.suffix
RUN cat /install_files/freedts.conf.suffix >> /etc/freetds/freetds.conf
COPY odbcinst.ini /etc/odbcinst.ini
COPY odbc.ini /etc/odbc.ini
# Install pyodbc from source

# install python requirements
COPY requirements.txt /install_files/requirements.txt
# Clone wayside into the docker container
RUN mkdir -p /repos/wayside
WORKDIR /repos/wayside
RUN git clone git@github.com:Adthena/wayside.git .
RUN python2 setup.py develop
RUN python3 setup.py develop
# Get rid of ssh key from image now repos have been cloned
RUN rm /root/.ssh/id_rsa
# Put the working directory back to notebooks at the end
WORKDIR /notebooks

Sum up
Find a base image
Run a containerand trialrun yourinstallsteps
Create a Dockerfileto performthose steps consistently
My environments
my public development environment -
my public dockerimages -
docker run -it --rm calvingiles/<image>
build upon with FROM calvingiles/<image>
fork (in github)if you need things a little different
github.com/calvingiles/data-science-
environment (https://github.com/calvingiles/data-science-environment)
hub.docker.com/u/calvingiles/
(https://hub.docker.com/u/calvingiles/)

Thanks
calvin.giles@gmail.com
@calvingiles

Docker for data science

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (14)

Similar to Docker for data science

Similar to Docker for data science (20)

Recently uploaded

Recently uploaded (20)

Docker for data science