Ferry - Share and Deploy Big Data Applications with Docker by James Horey PyData SV 2014


Published on

Ferry is a Python-based, open-source tool to help developers share and run big data applications. Users can provision Hadoop, Cassandra, GlusterFS, and Open MPI clusters locally on their machine using YAML and afterwards distribute their applications via Dockerfiles. These capabilities are useful for data scientists experimenting with big data technologies, developers that need an accessible big data development environment, or for developers simply interested in sharing their big data applications. In this presentation, I’ll introduce you to Docker, show you how to create a simple big data application in Ferry, and discuss ways the Python community can contribute to the open-source project. I’ll also discuss future directions for Ferry with a focus on better application sharing and operational deployments.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ferry - Share and Deploy Big Data Applications with Docker by James Horey PyData SV 2014

  1. 1. Ferry - Share & Deploy Big Data Applications with Docker James Horey
  2. 2. • Writing a simple application with Bokeh • Packaging our application with Docker • Orchestrating our application with Ferry Technical material can be found at: https://github.com/jhorey/pydata
  3. 3. Bokeh
  4. 4. U.S. Census http://api.census.gov/data/2011/acs5?get=DP03_0062E&for=county:*&in=state:06 Median income All counties California
  5. 5. Download some data
  6. 6. Let’s install Bokeh $ pip install bokeh >> Downloading/unpacking bokeh >> SystemError: Cannot compile 'Python.h'. Perhaps you need to install python-dev|python-devel. $ apt-get install python-dev & pip install bokeh >> "gcc: error trying to exec 'cc1plus': execvp: No such file or directory $ apt-get install g++ $ pip install bokeh RuntimeError: bokeh sample data directory does not exist, please execute bokeh.sampledata.download() $ python >>> import bokeh.sampledata
  7. 7. A simple application $ python plot.py Kentucky Louisville
  8. 8. Let’s share #!/bin/bash ! # Make sure we have ‘pip’ installed apt-get install python-pip ! # Install packages in right order apt-get —-yes install g++ python-dev pip install bokeh ! # Now download the data python geography.py data/ python population economic Kentucky data/ ! # Start the web server python webserver data/ • Your script didn’t work • Oh, I was supposed to run this as sudo? • Ok, it still didn’t work • I get this funny error • Oh yeah, I’m running Redhat • Ok I’m at my desk, just use my computer
  9. 9. • Encapsulates applications in isolated containers • Makes it easy and safe to distribute applications • Easy to get started
  10. 10. Our Dockerfile Start from a clean Precise image Install stuff Add our files Run this when starting $ docker build -t ferry/pydata . $ docker push ferry/pydata
  11. 11. Sharing made simple $ docker pull ferry/pydata $ docker run -p 8000:8000 -name p1 —d ferry/pydata p1 Kernel Hardware
  12. 12. Sharing made simple $ docker pull ferry/pydata $ docker run -p 8000:8000 -name p1 —d ferry/pydata $ docker run -p 8001:8000 -name p2 —d ferry/pydata $ docker run -p 8002:8000 -name p3 —d ferry/pydata p1 p2 p3 Kernel Hardware • Containers share basic kernel and H.W. capabilities • No virtualization • Containers are isolated • Access via port forwarding You can run these commands now!
  13. 13. • Highly scalable and fault-tolerant • Great for storing streaming data (sensors, messages) CREATE KEYSPACE census WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; ! USE census; ! CREATE TABLE acs_economic_data ( state_cd TEXT, state_name TEXT, county_cd TEXT, county_name TEXT, median INT, mean INT, capita INT, PRIMARY KEY(count_cd, state_cd) );
  14. 14. Orchestration Web DB Web + DB • Simple • Full control • More work for you • Simpler Dockerfile • More extensible • How to orchestrate?
  15. 15. • Specify the containers that constitute your application in YAML • Support for Hadoop, Cassandra, GlusterFS, and OpenMPI • It’s a little bit like pip for your Docker-based runtime environment Ferry http://ferry.opencore.io
  16. 16. Our Application backend: - storage: personality: "cassandra" instances: 1 connectors: - personality: "ferry/pydata-cassandra" ports: ["8000:8000"] # The cassandra-client base comes with the various drivers # pre-installed. FROM ferry/cassandra-client NAME ferry/pydata-cassandra ! # Place the start scripts in the events directories so they # are started when the connector is brought up. ADD ./scripts/startcas.sh /service/runscripts/start/ ADD ./scripts/restartcas.sh /service/runscripts/restart/ RUN chmod a+x /service/runscripts/start/startcas.sh RUN chmod a+x /service/runscripts/restart/restartcas.sh +
  17. 17. Easy to share (again) $ ferry start cassandra.yml sa-df8d0aa6 $ ferry ps UUID Storage Compute Connectors Status Base Time ---- ------- ------- ---------- ------ ---- ---- sa-df8d0aa6 se-54ed4e93 se-a5350a8d running cassandra.yml $ ferry ssh sa-df8d0aa6 root@client-se-a5350a8d:~# ps -eaf | grep python root 144 1 0 19:49 ? 00:00:00 python /home/ferry/ pydata/bokeh/webserver.py /home/ferry/pydata/data
  18. 18. What’s it doing? $ ferry start cassandra.yml Web C* C* root@client-se-a5350a8d:~# env | grep BACK BACKEND_STORAGE_TYPE=cassandra BACKEND_STORAGE_IP= Generate! Config
  19. 19. What’s it doing? $ ferry start yarn Client Y Y root@client-se-b597cb21:~# env | grep BACK BACKEND_STORAGE_TYPE=gluster BACKEND_STORAGE_IP= BACKEND_COMPUTE_TYPE=yarn BACKEND_COMPUTE_IP= G G
  20. 20. What’s it doing? $ ferry stop sa-c6cbb572 Client Y Y G G
  21. 21. Next steps $ ferry share sa-df8d0aa6 w c* c* Hardware w c* c* Hardware w c* c* Hardware
  22. 22. Next steps $ ferry deploy sa-df8d0aa6 w c* c* Hardware w c* c* Hardware Hardware Hardware VPCEC2 S3
  23. 23. • Even simple applications can be complicated to install and run • Docker helps quite a bit with this • Ferry helps build out big data applications
  24. 24. Thank you! ! James jlh@opencore.io ! Ferry ferry.opencore.io @open_core_io