Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Docker for data science


Published on

Setting up a local docker environment for a data science workflow in order to escape the nightmare that is package managers

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

Docker for data science

  1. 1. Docker for Data Science Down with package managers,upwith docker Calvin Giles- @calvingiles
  2. 2. Who knows what docker is? Who uses docker?
  3. 3. Who am I? APhysicist -MPhys fromUniversity of Southampton AData Scientist at Adthena APyData meetupand conference co-organiser Data Science Programming in python fornearly 10years Using dockerfor3months
  4. 4. Who am I not? Acomputerscientist DevOps Adockerexpert Adockercontributor
  5. 5. My Problem Imaintained a document.It was 450lines long and growing.
  6. 6. It got worse Ruby Aprojectrequiringrubywasn'tsupportedbyMacPorts.iwouldhavetoinstalloutsideofmy packagemanager.
  7. 7. Things started to break.
  8. 8. Then Idecided to contribute to sklearn. Followed by many build errors. The quickest solution -disable macports. $ git clone git:// $ python build_ext --inplace
  9. 9. When Ire-enabled MacPorts,it was neverthe same again.
  10. 10. With my environment in tatters... ...and faced with re-installing fromscratch,Idecided there must be a betterway. What about: Homebrew Boxen virtualenv anaconda npm,rpm vagrant,chef,puppet,ansible VirtualBox,VMFusion Docker,CoreOS Rocket fig,dokku,flynn,deis Surely one of these would help?
  11. 11. What do I want in a solution? Trivialto wipe the slate clean and recreate Portable (home laptopenv ==work laptopstate) Easy to share Configure once,use everywhere Remote databases,servers etc. Customisation (sublime,.virc,.bashrcetc.) Installation quirks No system-wide backuprequired Compatible with deployment to servers OS X-centric
  12. 12. Introducing Docker!! boot2docker-a single virtualmachine running on VirtualBox(OS XorWindows) dockerdaemon running on boot2dockeros dockercontainers running in partialisolation inside the same boot2dockervirtual machine dockerclient running on the host (OS X)to simplify the issuing of commands dockerimages as templates forcontainers dockerimages ->.isostyle templates dockercontainers ->lightweight virtualmachines intended to run justone process each
  13. 13. What do you get with docker? Run multiple environments indipendently Run services indipendently of environments e.g.databases Permit an environment to interact with a specific subset of the host files Share a poolof resources between allenvironments Asingle containercan consume 100%of CPU,RAMand HDD Quotas forwhen resources busy
  14. 14. What can be problematic Trust -processes are given read-write access to yourfiles stick to trusted builds and automated builds not really different to installing any software Resources are limited to VMallocation Lot's to learn Managing containers (starting,stopping etc.)
  15. 15. Get docker boot2docker.dmgor.exe apt-get install Initialise: ( $ boot2docker init $ boot2docker start $ $(boot2docker shellinit) $ docker login
  16. 16. What can you do? Startanipythonshell: $dockerrun-it--rmipython/ipythonipython
  17. 17. What can you do? Runapythonscriptinanipythonshellwiththescipystack: $dockerrun -it--rm -v$(pwd):/home -w=/home ipython/scipystack
  18. 18. What can you do? Runanotebookserver: $dockerrun-ePASSWORD=MyPass-it--rmipython/scipyserver
  19. 19. What can you do? Converta.ipynbfileintoreveal.jsslides(likethese)andservethem: $dockerrun -i-t--rm -p8000:8000 -v"$(pwd)":/slides nbreveal /'DockerforDataScience.ipynb'
  20. 20. What can you do? Startacompleteenvironment: $cd~/fig/data-science-env $figup-d
  21. 21. What do all these arguments do? -drun as daemon -i -t --rmrun interactively and auto-remove on exit -eset an env variable -pmapa port like -p host:container -vmapa host volume into the container --linkautomatically link containers,particularly databases -wset the working directory --volumes-frommapallthe volumes fromthe named container
  22. 22. Where do containers come from? Containers can be started using:docker run <image>.
  23. 23. Where do images come from? The trusted builds on dockerhub (like ubuntu,postgres,node etc.) Open source providers with automated builds (like ipython,julia etc.) Public images uploaded in a built state (quite opaque) Private images (built locally orvia docker login)
  24. 24. How do I build my own images? Write a Dockerfile. Eitherbuild and run it locally like: Orupload it to github and have the dockerhub build it foryou automatically: Wait forbuild... $ docker build -t calvingiles/magic-image . $ docker run calvingiles/magic-image $ git push $ docker run calvingiles/magic-image
  25. 25. What is this Dockerfile? FROM ipython/scipyserver MAINTAINER Calvin Giles <> # Create install folder RUN mkdir /install_files # Install postgres libraries and python dev libraries # so we can install psycopg2 later RUN apt-get update RUN apt-get install libpq-dev python-dev # install python requirements COPY requirements.txt /install_files/requirements.txt RUN pip2 install -r /install_files/requirements.txt RUN pip3 install -r /install_files/requirements.txt # Set the working directory to /notebooks WORKDIR /notebooks
  26. 26. Components of a Dockerfile FROM:anotherimage to build upon (ubuntu,debian,ipython...) RUN:execute a command in the containerand write teh results into the image COPY:copy a file fromthe build filesystemto the image WORKDIR:change the working directory (the containerstarts in the last WORKDIR) ENV:set and env variable EXPOSE:open upa port to linked containers and the host
  27. 27. So how do I actually use docker? Find an image to start yourenvironment off (ubuntu,ipython/scipystack, rocker/rstudio) Create a Dockerfilecontaining only a FROMline: build and run FROM ipython/scipystack
  28. 28. Let's start with the ipython notebook serverwith scipystack: Find yourboot2docker ip: Navigate there https://your-ip:443and sign in with the PASSWORD $ echo 'FROM ipython/scipyserver' > Dockerfile $ docker build -t ipython-dev-env . $ docker run -i -t --rm -e PASSWORD=MyPass -p 443:8888 ipython-dev-env $ boot2docker ip
  29. 29. How do I build on this?
  30. 30. Install an extra python module into a notebook server Test the installof the package you want: In[5]: !pip3searchgensim gensim -PythonframeworkforfastVectorSpaceModelling In[7]: !pip3installgensim Downloading/unpackinggensim Downloadinggensim-0.10.3.tar.gz(3.1MB):3.1MBdownloaded warning:nofilesfoundmatching'*.sh'underdirectory'.' nopreviously-includeddirectoriesfoundmatching'docs/src*' Requirementalreadysatisfied(use--upgradetoupgrade):numpy>=1.3in/usr/local/lib/pyt hon2.7/dist-packages(fromgensim) Requirementalreadysatisfied(use--upgradetoupgrade):scipy>=0.7.0in/usr/local/lib/p ython2.7/dist-packages(fromgensim) Requirementalreadysatisfied(use--upgradetoupgrade):six>=1.2.0in/usr/lib/python2.7 /dist-packages(fromgensim) Installingcollectedpackages:gensim Runningsetup.pyinstallforgensim warning:nofilesfoundmatching'*.sh'underdirectory'.' nopreviously-includeddirectoriesfoundmatching'docs/src*' building'gensim.models.word2vec_inner'extension x86_64-linux-gnu-gcc-pthread-fno-strict-aliasing-DNDEBUG-g-fwrapv-O2-Wall-Wstr ict-prototypes-fPIC-I/tmp/pip_build_root/gensim/gensim/models-I/usr/include/python2.7- I/usr/local/lib/python2.7/dist-packages/numpy/core/include-c./gensim/models/word2vec_inn er.c-obuild/temp.linux-x86_64-2.7/./gensim/models/word2vec_inner.o Infileincludedfrom/usr/include/python2.7/numpy/ndarraytypes.h:1761:0, from/usr/include/python2.7/numpy/ndarrayobject.h:17, from/usr/include/python2.7/numpy/arrayobject.h:4, from./gensim/models/word2vec_inner.c:232: /usr/include/python2.7/numpy/npy_1_7_deprecated_api.h:15:2:warning:#warning"Usingd eprecatedNumPyAPI,disableitby""#definingNPY_NO_DEPRECATED_APINPY_1_7_API_VERSION" [-Wcpp]
  31. 31. In [10]: import gensim Ifthatworks,addtheinstallcommandstoyourDockerfile: Andrebuild: FROMipython/scipyserver RUNpip2installgensim RUNpip3installgensim $dockerbuild-tipython-dev-env. $dockerrun-i-t--rm-ePASSWORD=MyPass-p443:8888ipython-dev-env
  32. 32. I want to use a requirements.txt file Create requirement.txt Dockerfile: $ echo 'gensim' >> requirements.txt FROM ipython/scipyserver COPY requirements.txt /requirements.txt RUN pip2 install -r /requirements.txt RUN pip3 install -r /requirements.txt
  33. 33. But what do I put in requirements.txt? In[12]: !pip3freeze|head Cython==0.20.1post0 Jinja2==2.7.2 MarkupSafe==0.18 Pillow==2.3.0 Pygments==1.6 SQLAlchemy==0.9.8 Sphinx==1.2.2 brewer2mpl==1.4.1 certifi==14.05.14 chardet==2.0.1
  34. 34. How do I install system libraries for MSSQL Server? Create yourodbcinst.ini,odbc.iniand freetds.conffiles. RUN apt-get update &&& apt-get -y install unixodbc unixodbc-dev freetds-dev tdsodbc COPY freedts.conf >> /etc/freetds/ COPY odbcinst.ini /etc/ COPY odbc.ini /etc/
  35. 35. How do I install the PyODBC library from source? RUN pip2 install RUN pip3 install
  36. 36. How do I get a database? You willget the IP and PORTS to connect to as env variables in the ipython-dev-env container $ docker run -d --name dev-postgres postgres $ docker run -d -e PASSWORD=MyPass -p 443:8888 --link dev-postgres:dev-postgres ipython-dev-env
  37. 37. What about my data? $ docker run -d -v "~/Google Drive/data:/data" --name gddata busybox echo $ docker run -d -e PASSWORD=MyPass -p 443:8888 --volumes-from gddata ipython-dev-env
  38. 38. Help, I ran out of RAM $ VBoxManage modifyvm boot2docker-vm --memory 5555 $ boot2docker stop $ boot2docker start $ boot2docker info { ... 'Memory': 5555 ... }
  39. 39. Git push? In Dockerhub,create new and select a Automated Build. Point it to yourgithub orbitbucket repo Wait forthe build to complete $ docker pull calvingiles/data-science-environment $ docker run calvingiles/data-science-environment
  40. 40. I seem to be running a lot of containers Fig can helpa lot with that. Installfig: Create a fig.ymlfile specifying a set of containers to start fig up -dto begin
  41. 41. Is this all really better than before? Iusedockerfor100%ofmydatasciencetasks. Iusedockerfornearlyeverythingelse.
  42. 42. FROM ipython/scipyserver MAINTAINER Calvin Giles <> # Create install folder RUN mkdir /install_files # Update aptitude with new repo RUN apt-get update # Install software RUN apt-get install -y git # Make ssh dir RUN mkdir /root/.ssh/ ## Authenticate with github # Copy over private key, and set permissions COPY id_rsa /root/.ssh/id_rsa RUN chmod 600 /root/.ssh/id_rsa # Create known_hosts RUN touch /root/.ssh/known_hosts # Add github key RUN ssh-keyscan >> /root/.ssh/known_hosts ## install pyodbc so we can talk to MS SQL # install unixodbc and freetds RUN apt-get -y install unixodbc unixodbc-dev freetds-dev tdsodbc # configure Adthena database with read-only permissions COPY freetds.conf.suffix /install_files/freedts.conf.suffix RUN cat /install_files/freedts.conf.suffix >> /etc/freetds/freetds.conf COPY odbcinst.ini /etc/odbcinst.ini COPY odbc.ini /etc/odbc.ini # Install pyodbc from source RUN pip2 install RUN pip3 install
  43. 43. # install python requirements COPY requirements.txt /install_files/requirements.txt RUN pip2 install -r /install_files/requirements.txt RUN pip3 install -r /install_files/requirements.txt # Clone wayside into the docker container RUN mkdir -p /repos/wayside WORKDIR /repos/wayside RUN git clone . RUN python2 develop RUN python3 develop # Get rid of ssh key from image now repos have been cloned RUN rm /root/.ssh/id_rsa # Put the working directory back to notebooks at the end WORKDIR /notebooks
  44. 44. Sum up Find a base image Run a containerand trialrun yourinstallsteps Create a Dockerfileto performthose steps consistently My environments my public development environment - my public dockerimages - docker run -it --rm calvingiles/<image> build upon with FROM calvingiles/<image> fork (in github)if you need things a little different environment ( (
  45. 45. Thanks @calvingiles