2. Motivation
• Software development and data science have distinct
lifecycles.
• Repeatability is fundamental to both.
• Bridging the data science lifecycle into the software
development lifecycle presents challenges.
3. Multi-tenancy with Multiplicity
• No tool really does it all. (Sorry.)
• Data wrangling, ETL/ELT, different algorithms hosted in different
compute frameworks, …
• Data pipeline or workflow to tie it all together.
• Everyone wants something different, sometimes for good reasons.
Being able to run a large number of different workloads for a large
number of different users is a win.
4. Containers
• Package apps with their libraries in a (relatively) clean manner
— especially important for native code.
• Ensure traceability of code, presuming that there is a solid CI
and repository solution in place.
5. Kubernetes is awesome.
For reasons you already know:
• Bin packing.
• Horizontal scale-out for the platform, auto-scaling for pods.
• Service discovery, load balancing.
• Self-healing.
• Batch execution.
And more reasons in the future:
• GPU affinity.
• Backplane for Spark.
6. A Simple Idea
What if we could package
workloads in containers and
then kubectl could be our
fundamental devops
primitive…?
Napkin Sketch:
1. Build a control plane
that knows how to
stamp out workloads via
a Provisioning API.
2. Profit.
Kubernetes
Control Plane
Workload1
Workload2
Workload3
Provisioning API
7. Challenges
• Typical workloads consist of multiple types of containers that need
to collaborate.
• Containerization (often) isn’t that bad, depending on your taste.
• Many workloads or components thereof (e.g., Spark) aren’t designed
in a manner that permits the best use of Kubernetes facilities.
Surgery (or holding your nose) is frequently required, but sometimes
(e.g., TensorFlow!) things work well from the start.
8. Example
Problem:
• Zookeeper
• Nodes have distinct identity, and the client protocol is designed
to defy load balancing.
Solution:
• Replication controller per node and call it a day.
9. Some Familiar Problems
Once you can stamp out workloads, you get down to familiar problems:
• Tenant-attributed logging (workload and user) and metrics.
• “Billing” and metering.
• Visibility and other flavors of operability.
• Security — from purposeful or accidental attackers.
• Workload isolation, e.g., for PII.
Fixing these problems frequently frequently requires surgery, and none of
these problems are unique to containerization or cluster scheduling of
workloads, i.e., you have to solve them anyway.
10. Wrap Up
• Building a data processing platform on Kubernetes has some
obvious starting points and some familiar challenges.
• More data scientists and middleware makers are starting with
containers as a packaging scheme.