https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
Complex computational workloads in Python are a common sight these days, especially in the context of processing large and complex datasets. Battle-hardened modules such as Numpy, Pandas, and Scikit-Learn can perform low-level tasks, while tools like Dask makes it easy to parallelize these workloads across distributed computational environments. Meanwhile, Argo Workflows offers a Kubernetes-native solution to provisioning cloud resources in Kubernetes and triggering workflows on a regular schedule. Being Kubernetes-native, Argo Workflows also meshes nicely with other Kubernetes tools. This talk discusses the combination of these two worlds by showcasing a set-up for Argo-managed workflows which schedule and automatically scale-out Dask-powered data pipelines in Python.
BIO
Former academic in the field of renewable energy simulation and energy systems analysis. Currently responsible for architecting and maintaining the cloud- and data strategy at ACCURE Battery Intelligence
KEY TAKE-AWAYS FROM THE TALK
Argo Workflows + Dask is a nice combination for data-processing pipelines. There are a a few "gotchyas" to be on the look-out for, but in nevertheless this is still a generally-applicable and powerful combination.
https://github.com/sevberg
2. Severin Ryberg 20/02/2022 2
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A
Goals of this presentation:
o Understand why use Argo+Dask for automated data pipeline
scheduling made sense for Us
o Provide a rough overview of our infrastructure set-up
o Describe basic Argo Workflows scaling example
o Describe basic Dask data pipeline example
o Showcase set-up
3. INTRODUCTION
WHO AM I? WHAT DO I DO?
20/02/2022
Severin Ryberg 3
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A
5. Current Status
20/02/2022
Severin Ryberg 5
Infrastructure Architect
o Start-up founded in Mid 2020
• Closing in on 50 employees (Hiring!)
o Battery Intelligence as a Service
• Basic Monitoring
• State of Health
• Safety alerting
• Operation optimization
o USP
• Spin-off from renown research group
• 100% software driven
• Born in the cloud
o Primary tools / languages
• Python
• AWS
• Kubernetes
o Sole infra developer for awhile
• “Infra“ Team size now around 10
o Data Engineer
• Onboarding new customers, Data conditioning
o Developer
• Develop & maintain company-wide fudamental
tools. Primarily in Python. (Hint! Dask )
o Devops Engineer
• Automate my job away using GitLab CI
o Cloud Engineer
• AWS: EKS, S3, IAM, Lambda, and all that‘s in between
o Kubernetes Engineer
• Computation pipeline schedule & scale reliably
(Hint! Argo )
7. Operational
o Stay in Python
• Aligns with developer team’s skillset
• Spark flips between Python and the JVM
o Need to scale generic black-box functions
• Goes beyond “simple” map/reduce
o Shared development experience in testing and production environments
• Same code for sequential, parallel, and distributed contexts
o Conduct both batch-processing and shared parallel-processing
o Promote self-service for data engineers
ACCURE’S REQUIREMENTS
20/02/2022
Severin Ryberg 7
8. Infrastructure and Security
o Ultra-low latency parallelization
• Pod spin up times greatly slow down workflow runs
• Need to set up pools of pods for highly parallelized
operations
o Multi-tenant environment
• Dedicated namespace per customers
• Operators can access each namespace
• Workflow service account scoped to the namespace
o Cost efficient computations
• Elastic compute infrastructure should automatically
scale up and down according to load
ACCURE’S REQUIREMENTS
20/02/2022
Severin Ryberg 8
o Deployment automation and version controlling
o High data throughput
• Avoid database bottlenecks for data-in and data-out
o Secure access
• Only robots access production environment
• Customer-specific credentials allow access to own data
o Dependable scheduling
o Exporting of logs to ELK
o Archiving of workflow execution history
9. Why not use a service?
o Apache Airflow
• High learning curve for pipeline developers
• Poor Kubernetes support
- Note! Prior to Airflow 2.0
• Still need to maintain the Kubernetes cluster yourself
o Prefect
• First tried option
• Early-stage start-up going through its own growing pains
- Note! This was in Jan – March 2021
• Change it cost model would drastically change our price point
o AWS Batch, AWS Glue, AWS Data Pipeline, etc…
• Batch was used when having troubles with other solutions. Is okay, but not very flexible
• We have a preference to stay cloud-agnostic as much as possible
20/02/2022
Severin Ryberg 9
10. Why not use a service?
o Apache Airflow
• High learning curve for pipeline developers
• Poor Kubernetes support
- Note! Prior to Airflow 2.0
• Still need to maintain the Kubernetes cluster yourself
o Prefect
• First tried option
• Early-stage start-up going through its own growing pains
- Note! This was in Jan – March 2021
• Change it cost model would drastically change our price point
o AWS Batch, AWS Glue, AWS Data Pipeline, etc…
• Batch was used when having troubles with other solutions. Is okay, but not very flexible
• We have a preference to stay cloud-agnostic as much as possible
20/02/2022
Severin Ryberg 10
12. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 12
13. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 13
Kubernetes
14. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 14
Kubernetes
K8S
Autoscaler
15. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 15
Kubernetes
K8S
Autoscaler
Argo
Workflows
16. Argo Workflows Overview
Argo Workflows is an open source container-native workflow engine
for orchestrating parallel jobs on Kubernetes. Argo Workflows is
implemented as a Kubernetes CRD.
o Define workflows where each step in the workflow is a container.
o Model multi-step workflows as a sequence of tasks or capture the
dependencies between tasks using a graph (DAG).
- Argo Workflows homepage
INTERMISSION
20/02/2022
Severin Ryberg 16
https://github.com/mcgrawia/argocon-21-demo
17. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 17
Kubernetes
K8S
Autoscaler
Argo
Workflows
18. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 18
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
19. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 19
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql
20. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 20
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
21. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 21
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
22. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 22
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask
23. Dask Overview
o Dask: high-throughput data-pipelines in python
o Comes out of the box with
• Multi-domain execution
• Dask-dataframes & futures interface
o Supporting resources
• Dask development support from the community, and as a service
• Cluster-provisioning services (e.g. Coiled)
o ACCURE had to implement
• Work-avoidance
• Task artifacting (on S3)
• Logging to ELK
INTERMISSION II
20/02/2022
Severin Ryberg 23
https://docs.dask.org/en/latest/
https://docs.dask.org/en/latest/graphs.html
24. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 24
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask
25. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 25
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask Coiled
26. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 26
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask Coiled
ACCURE Utilities
27. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 27
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask Coiled
ACCURE Utilities
ACCURE Data Pipelines
28. Combining Dask and Argo Workflows
20/02/2022
Severin Ryberg 28
Pipeline configuration (Workflow Template)
Standard dask scaler (Cluster Workflow Template)
Initiate workflow
Dask Scheduler
(daemon)
Dask Worker Deployment
(daemon)
Worker tear down
(on exit)
Primary pipeline
29. Combining Dask and Argo Workflows
20/02/2022
Severin Ryberg 29
Pipeline configuration (Workflow Template)
Standard dask scaler (Cluster Workflow Template)
Initiate workflow
Dask Scheduler
(daemon)
Dask Worker Deployment
(daemon)
Worker tear down
(on exit)
Primary pipeline
Image repo, pod resources (workers,
scheduler, pipeline), pipeline settings
30. Combining Dask and Argo Workflows
20/02/2022
Severin Ryberg 30
Pipeline configuration (Workflow Template)
Standard dask scaler (Cluster Workflow Template)
Initiate workflow
Dask Scheduler
(daemon)
Dask Worker Deployment
(daemon)
Worker tear down
(on exit)
Primary pipeline
Image repo, pod resources (workers,
scheduler, pipeline), pipeline settings
32. Combining Dask and Argo Workflows
20/02/2022
Severin Ryberg 32
Pipeline configuration (Workflow Template)
Standard dask scaler (Cluster Workflow Template)
Initiate workflow
Dask Scheduler
(daemon)
Dask Worker Deployment
(daemon)
Worker tear down
(on exit)
Primary pipeline
Image repo, pod resources (workers,
scheduler, pipeline), pipeline settings
Image tag, worker count,
pipeline arguments
“Product-specific Dask image”
Python:3.8.8-buster
● Pip-install dask, distributed, bokeh, and
whatever else you need
● Activate environment (e.g. conda)
● Ensure “dask-scheduler’ and “dask-
worker” are on path
● Inject python script
33. Combining Dask and Argo Workflows
20/02/2022
Severin Ryberg 33
Pipeline configuration (Workflow Template)
Standard dask scaler (Cluster Workflow Template)
Initiate workflow
Dask Scheduler
(daemon)
Dask Worker Deployment
(daemon)
Worker tear down
(on exit)
Primary pipeline
Image repo, pod resources (workers,
scheduler, pipeline), pipeline settings
Image tag, worker count,
pipeline arguments
“Product-specific Dask image”
Python:3.8.8-buster
● Pip-install dask, distributed, bokeh, and
whatever else you need
● Activate environment (e.g. conda)
● Ensure “dask-scheduler’ and “dask-
worker” are on path
● Inject python script
34. Example Dask Pipeline - Today’s example
o Timeseries weather data in Spain
• Madrid, Valencia, Barcelona,
Seville, Bilbao
o Which is the windiest city!?
20/02/2022
Severin Ryberg 34
Get all timestamps
Get all timestamps
Get all timestamps
Get all timestamps
Identify windiest city at
timestamp
Count windiest city
observations
Report
pipeline
dask worker
Computed in…
35. DEMO
THE PROOF IS IN THE PUDDING
20/02/2022
Severin Ryberg 35
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A
36. Git Repo
o Code is available at: https://github.com/pipekit/argocon-demos
• Created with the help of the Pipekit team! (https://pipekit.io/)
o Contains:
• Sample Argo workflows installation
- Workload is too heavy for Docker Desktop, Minikube, etc. - use cloud k8s
• Cluster workflow template containing Dask pipeline
• Workflow template that invokes the Dask pipeline
• Dockerized python pipeline that schedules tasks on Dask
• Sample weather data for major cities in Spain
20/02/2022
Severin Ryberg 36
37. Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The demo stack
20/02/2022
Severin Ryberg 37
Kubernetes
Argo
Workflows
Dask
Simple Data
Pipeline
38. Kubernetes Cluster Internals - Demo
o Any “substantial” Kubernetes cluster should work
• Locally-hosted K3S, minikube, or Docker Desktop don’t work
o Cluster wide Argo-Workflows installation
• Provision a namespace for each customer
• Operators have access to creating, updating, deleting
Argo Workflows and Workflow Templates in namespaces
o Customer-specific namespaces
• Workflow templates differentiated by namespace,
configured for specific customer context (i.e. IAM secret)
• Cron workflows trigger workflow templates
• Customer-specific credentials and other secrets isolated
to namespace
20/02/2022
Severin Ryberg 38
Kubernetes cluster
Cluster scope
Customer-
namespaces
Workflow
Template
CRON Wf
Template
Customer
specific config
Argo server
Controller
config-map
Argo
controller
Cluster Wf
Template
RoleBinding
ClusterRole
40. Closing remarks
o When to use Argo Workflows & Dask vs when not to:
1. Scheduled vs. “ad-hoc” computing
- Scheduled / automated workflows? Argo and Dask is great!
- Are humans in the loop? If so, a Dask-as-a-service may be better (e.g. Coiled)
Provides for fast development iteration
Empowers self-service of data engineers
2. Dask is not a silver bullet
- Some instabilities for long-running tasks (hours, yes - days, no)
- Can do batch processing, but not the main focus
- Dask is an active project. Community involvement / interest can help improve this!
o Dask Alternatives (no significant personal experience)
• Couler – Develop Argo workflows directly in python
• Ray – Low level parallelization engine in Python
20/02/2022
Severin Ryberg 40