Analyzing Data With Docker
Andreas Dewes (@japh44)
EuroPython 2016 - Bilbao
Outline
Data Analysis: Small & Large-Scale, Easy & Difficult
Introduction To Docker
Containerizing our Data Analysis
Possible Approaches
Relevant Technologies & Outlook
Data Analysis: Use Cases
small-scale large-scale
automated
interactive
Interactive, UI-based analysis
(e.g. iPython notebook)
analysis scripts using
Local data sources
(e.g. databases)
non-interactive analysis pipelines
(e.g. Apache Hadoop)
Interactive “Big Data” tools, e.g
Apache Spark or Google BigQuery
So what's so difficult about data analysis?
Sharing Data & Tools
Reproducibility
Scaling
Enter Docker....
What is Docker?
A tool that allows us to deploy applications inside "software
containers".
Containers work at the process level and isolate the view of the
operating system (i.e. the processes, resources and files an
application sees)
Provides a high-level API to manage, version-control, deploy and
network containers.
Docker Swarm
Docker Core-Concepts
Docker EngineDocker Engine
Docker API
Registry
CLI
Image
Image
Image
Container
Container
Container
Container
Container
Images Are Space-Efficient
(or at least more efficient than VMs)
Containers Have Little Overhead
https://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf
Containers Are Self-Sufficient
Containers Are "Lego" For Data Analytics!
Container
output
inputs
configuration
data
networked containers
We Can Build Reproducible Data-Analysis
Workflows With Them
Map Apache
logs
Map Nginx
logs
BI
Aggregate
results
Filtering Monitoring
Archiving
Example: Analyzing Github Data
analysis script
log files
from Github
output
analysis process(es)
Repository with code: https://github.com/adewes/docker-map-reduce-example
Live Demo (fingers crossed)
Containerizing Our Analysis
analysis script
log files
from Github
output
analysis container
image
analysis container
analysis container
supervisor
Live demo (what could go wrong?)
Advantages Disadvantages
Easy to share
Each analysis step is self-
sufficient
Analysis components are
"plug & play"
Easy to parallelize (for the
right problems)
Versioning included
Requires to prepare
containers
Requires Docker on each
machine
Slightly decreases
interactivity & flexibility
Which Parts Are Missing?
Orchestration
Dependency Management
Resource ManagementResource Management
Rouster:
A Python Tool for Containerized Data Analysis
Built on top of the Docker API
"Make for Docker"
Resource Management
Container Orchestration
Dependency Management
Rouster Uses Recipes to Describe Data Analysis
Workflows
Resources
(including dependencies)
Services
Actions
versioning, dependency calculation,
backup / copying, distribution, ...
startup (including dependencies),
resource provisioning, networking, ...
scheduling, monitoring, exception
handling, logging, ...
Live Demo: CSV -> Postgres
Open Questions
How to handle communication between containers
(through files, network, ...)?
How to provide resources/data to containers in a
distributed environment?
Pachyderm is a data lake that offers complete version control for
data and leverages the container ecosystem to
provide reproducible data processing. Built on top of Kubernetes.
http://www.pachyderm.io
Pachyderm
Luigi
Luigi is a Python module that helps you build complex pipelines
of batch jobs. It handles dependency resolution, workflow
management, visualization etc. It also comes with Hadoop
support built in.
https://github.com/spotify/luigi
Other relevant technologies
Summary & Outlook
Containers are here to stay!
They are useful in various data analysis contexts.
They don't solve all our problems though.
We need additional tools to use them effectively.
Thanks!
Want to contribute?
https://github.com/7scientists/rouster
Andreas Dewes (@japh44)
Image Licenses:
https://commons.wikimedia.org/wiki/File:Matryoshka_dolls_(3671820040)_(2).jpg
https://pixabay.com/de/nordlichter-lager-zelt-abenteuer-1203289/
https://en.wikipedia.org/wiki/Orchestra
https://de.wikipedia.org/wiki/Graph_(Graphentheorie)
http://www.library.illinois.edu/prescons/disaster_response/high_density_storage_disaster_plan/
https://brookeborel.com/2011/06/02/363/
https://en.wikipedia.org/wiki/Data_sharing

Analyzing data with docker v4

  • 1.
    Analyzing Data WithDocker Andreas Dewes (@japh44) EuroPython 2016 - Bilbao
  • 2.
    Outline Data Analysis: Small& Large-Scale, Easy & Difficult Introduction To Docker Containerizing our Data Analysis Possible Approaches Relevant Technologies & Outlook
  • 3.
    Data Analysis: UseCases small-scale large-scale automated interactive Interactive, UI-based analysis (e.g. iPython notebook) analysis scripts using Local data sources (e.g. databases) non-interactive analysis pipelines (e.g. Apache Hadoop) Interactive “Big Data” tools, e.g Apache Spark or Google BigQuery
  • 4.
    So what's sodifficult about data analysis?
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    What is Docker? Atool that allows us to deploy applications inside "software containers". Containers work at the process level and isolate the view of the operating system (i.e. the processes, resources and files an application sees) Provides a high-level API to manage, version-control, deploy and network containers.
  • 10.
    Docker Swarm Docker Core-Concepts DockerEngineDocker Engine Docker API Registry CLI Image Image Image Container Container Container Container Container
  • 11.
    Images Are Space-Efficient (orat least more efficient than VMs)
  • 12.
    Containers Have LittleOverhead https://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf
  • 13.
  • 14.
    Containers Are "Lego"For Data Analytics! Container output inputs configuration data networked containers
  • 15.
    We Can BuildReproducible Data-Analysis Workflows With Them Map Apache logs Map Nginx logs BI Aggregate results Filtering Monitoring Archiving
  • 16.
    Example: Analyzing GithubData analysis script log files from Github output analysis process(es) Repository with code: https://github.com/adewes/docker-map-reduce-example
  • 17.
  • 18.
    Containerizing Our Analysis analysisscript log files from Github output analysis container image analysis container analysis container supervisor
  • 19.
    Live demo (whatcould go wrong?)
  • 20.
    Advantages Disadvantages Easy toshare Each analysis step is self- sufficient Analysis components are "plug & play" Easy to parallelize (for the right problems) Versioning included Requires to prepare containers Requires Docker on each machine Slightly decreases interactivity & flexibility
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    Rouster: A Python Toolfor Containerized Data Analysis Built on top of the Docker API "Make for Docker" Resource Management Container Orchestration Dependency Management
  • 26.
    Rouster Uses Recipesto Describe Data Analysis Workflows Resources (including dependencies) Services Actions versioning, dependency calculation, backup / copying, distribution, ... startup (including dependencies), resource provisioning, networking, ... scheduling, monitoring, exception handling, logging, ...
  • 27.
    Live Demo: CSV-> Postgres
  • 28.
    Open Questions How tohandle communication between containers (through files, network, ...)? How to provide resources/data to containers in a distributed environment?
  • 29.
    Pachyderm is adata lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing. Built on top of Kubernetes. http://www.pachyderm.io Pachyderm Luigi Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in. https://github.com/spotify/luigi Other relevant technologies
  • 30.
    Summary & Outlook Containersare here to stay! They are useful in various data analysis contexts. They don't solve all our problems though. We need additional tools to use them effectively.
  • 31.
    Thanks! Want to contribute? https://github.com/7scientists/rouster AndreasDewes (@japh44) Image Licenses: https://commons.wikimedia.org/wiki/File:Matryoshka_dolls_(3671820040)_(2).jpg https://pixabay.com/de/nordlichter-lager-zelt-abenteuer-1203289/ https://en.wikipedia.org/wiki/Orchestra https://de.wikipedia.org/wiki/Graph_(Graphentheorie) http://www.library.illinois.edu/prescons/disaster_response/high_density_storage_disaster_plan/ https://brookeborel.com/2011/06/02/363/ https://en.wikipedia.org/wiki/Data_sharing