Pydata2014

Ferry - Share & Deploy Big
Data Applications with Docker
James Horey

• Writing a simple application with Bokeh
• Packaging our application with Docker
• Orchestrating our application with Ferry
Technical material can be found at:
https://github.com/jhorey/pydata

U.S. Census
http://api.census.gov/data/2011/acs5?get=DP03_0062E&for=county:*&in=state:06
Median income All counties California

Let’s install Bokeh
$ pip install bokeh
>> Downloading/unpacking bokeh
>> SystemError: Cannot compile 'Python.h'. Perhaps you need
to install python-dev|python-devel.
$ apt-get install python-dev & pip install bokeh
>> "gcc: error trying to exec 'cc1plus': execvp: No such file
or directory
$ apt-get install g++
$ pip install bokeh
RuntimeError: bokeh sample data directory does not exist, please
execute bokeh.sampledata.download()
$ python
>>> import bokeh.sampledata

A simple application
$ python plot.py Kentucky
Louisville

Let’s share
#!/bin/bash
!
# Make sure we have ‘pip’ installed
apt-get install python-pip
!
# Install packages in right order
apt-get —-yes install g++ python-dev
pip install bokeh
!
# Now download the data
python geography.py data/
python population economic Kentucky
data/
!
# Start the web server
python webserver data/
• Your script didn’t work
• Oh, I was supposed to run this as
sudo?
• Ok, it still didn’t work
• I get this funny error
• Oh yeah, I’m running Redhat
• Ok I’m at my desk, just use my
computer

• Encapsulates applications in isolated containers
• Makes it easy and safe to distribute applications
• Easy to get started

Our Dockerﬁle
Start from a
clean Precise
image
Install stuff
Add our ﬁles
Run this when
starting
$ docker build -t ferry/pydata .
$ docker push ferry/pydata

Sharing made simple
$ docker pull ferry/pydata
$ docker run -p 8000:8000 -name p1 —d ferry/pydata
p1
Kernel
Hardware

Sharing made simple
$ docker pull ferry/pydata
p1 p2 p3
Kernel
Hardware
• Containers share basic kernel
and H.W. capabilities
• No virtualization
• Containers are isolated
• Access via port forwarding
You can run these commands now!

• Highly scalable and fault-tolerant
• Great for storing streaming data (sensors,
messages)
CREATE KEYSPACE census WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 1 };
!
USE census;
!
CREATE TABLE acs_economic_data (
state_cd TEXT,
state_name TEXT,
county_cd TEXT,
county_name TEXT,
median INT,
mean INT,
capita INT,
PRIMARY KEY(count_cd, state_cd)
);

Orchestration
Web DB
Web + DB
• Simple
• Full control
• More work for you
• Simpler Dockerﬁle
• More extensible
• How to orchestrate?

• Specify the containers that constitute your
application in YAML
• Support for Hadoop, Cassandra, GlusterFS, and
OpenMPI
• It’s a little bit like pip for your Docker-based
runtime environment
Ferry
http://ferry.opencore.io

Our Application
backend:
- storage:
personality: "cassandra"
instances: 1
connectors:
- personality: "ferry/pydata-cassandra"
ports: ["8000:8000"]
# The cassandra-client base comes with the various drivers
# pre-installed.
FROM ferry/cassandra-client
NAME ferry/pydata-cassandra
!
# Place the start scripts in the events directories so they
# are started when the connector is brought up.
ADD ./scripts/startcas.sh /service/runscripts/start/
ADD ./scripts/restartcas.sh /service/runscripts/restart/
RUN chmod a+x /service/runscripts/start/startcas.sh
RUN chmod a+x /service/runscripts/restart/restartcas.sh
+

Easy to share (again)
$ ferry start cassandra.yml
sa-df8d0aa6
$ ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- ------- ---------- ------ ---- ----
sa-df8d0aa6 se-54ed4e93 se-a5350a8d running cassandra.yml
$ ferry ssh sa-df8d0aa6
root@client-se-a5350a8d:~# ps -eaf | grep python
root 144 1 0 19:49 ? 00:00:00 python /home/ferry/
pydata/bokeh/webserver.py /home/ferry/pydata/data

What’s it doing?
$ ferry start cassandra.yml
Web C* C*
root@client-se-a5350a8d:~# env | grep BACK
BACKEND_STORAGE_TYPE=cassandra
BACKEND_STORAGE_IP=10.1.0.12
Generate!
Conﬁg

What’s it doing?
$ ferry start yarn
Client
Y Y
root@client-se-b597cb21:~# env | grep BACK
BACKEND_STORAGE_TYPE=gluster
BACKEND_STORAGE_IP=10.1.0.18
BACKEND_COMPUTE_TYPE=yarn
BACKEND_COMPUTE_IP=10.1.0.15
G G

What’s it doing?
$ ferry stop sa-c6cbb572
Client
Y Y
G G

Next steps
$ ferry share sa-df8d0aa6
w c* c*
Hardware
w c* c*
Hardware
w c* c*
Hardware

Next steps
$ ferry deploy sa-df8d0aa6
w c* c*
Hardware
w
c* c*
Hardware
Hardware Hardware
VPCEC2
S3

• Even simple applications can be complicated to
install and run
• Docker helps quite a bit with this
• Ferry helps build out big data applications

Thank you!
!
James
jlh@opencore.io
!
Ferry
ferry.opencore.io
@open_core_io

Pydata2014

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Pydata2014

Similar to Pydata2014 (20)

Recently uploaded

Recently uploaded (20)

Pydata2014