Dok Talks #111 - Scheduled Scaling with Dask and Argo Workflows

Scheduled Scaling with Dask
and Argo Workflows
Dok Talks #111

Severin Ryberg 20/02/2022 2
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A
Goals of this presentation:
o Understand why use Argo+Dask for automated data pipeline
scheduling made sense for Us
o Provide a rough overview of our infrastructure set-up
o Describe basic Argo Workflows scaling example
o Describe basic Dask data pipeline example
o Showcase set-up

INTRODUCTION
WHO AM I? WHAT DO I DO?
20/02/2022
Severin Ryberg 3
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A

Background
20/02/2022
Severin Ryberg 4
2013
2015
Bachelors: Physics Masters: Electrical Eng.
2016
Adjunct Professor
2019
Developer
PhD researcher,
infra maintainer
Post Doctoral Researcher,
infra admin
2020
Infra Architect

Current Status
20/02/2022
Severin Ryberg 5
Infrastructure Architect
o Start-up founded in Mid 2020
• Closing in on 50 employees (Hiring!)
o Battery Intelligence as a Service
• Basic Monitoring
• State of Health
• Safety alerting
• Operation optimization
o USP
• Spin-off from renown research group
• 100% software driven
• Born in the cloud
o Primary tools / languages
• Python
• AWS
• Kubernetes
o Sole infra developer for awhile
• “Infra“ Team size now around 10
o Data Engineer
• Onboarding new customers, Data conditioning
o Developer
• Develop & maintain company-wide fudamental
tools. Primarily in Python. (Hint! Dask )
o Devops Engineer
• Automate my job away using GitLab CI
o Cloud Engineer
• AWS: EKS, S3, IAM, Lambda, and all that‘s in between
o Kubernetes Engineer
• Computation pipeline schedule & scale reliably
(Hint! Argo )

MOTIVATION
WHAT PROBLEMS NEED SOLVING?
20/02/2022
Severin Ryberg 6
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A

Operational
o Stay in Python
• Aligns with developer team’s skillset
• Spark flips between Python and the JVM
o Need to scale generic black-box functions
• Goes beyond “simple” map/reduce
o Shared development experience in testing and production environments
• Same code for sequential, parallel, and distributed contexts
o Conduct both batch-processing and shared parallel-processing
o Promote self-service for data engineers
ACCURE’S REQUIREMENTS
20/02/2022
Severin Ryberg 7

Infrastructure and Security
o Ultra-low latency parallelization
• Pod spin up times greatly slow down workflow runs
• Need to set up pools of pods for highly parallelized
operations
o Multi-tenant environment
• Dedicated namespace per customers
• Operators can access each namespace
• Workflow service account scoped to the namespace
o Cost efficient computations
• Elastic compute infrastructure should automatically
scale up and down according to load
ACCURE’S REQUIREMENTS
20/02/2022
Severin Ryberg 8
o Deployment automation and version controlling
o High data throughput
• Avoid database bottlenecks for data-in and data-out
o Secure access
• Only robots access production environment
• Customer-specific credentials allow access to own data
o Dependable scheduling
o Exporting of logs to ELK
o Archiving of workflow execution history

Why not use a service?
o Apache Airflow
• High learning curve for pipeline developers
• Poor Kubernetes support
- Note! Prior to Airflow 2.0
• Still need to maintain the Kubernetes cluster yourself
o Prefect
• First tried option
• Early-stage start-up going through its own growing pains
- Note! This was in Jan – March 2021
• Change it cost model would drastically change our price point
o AWS Batch, AWS Glue, AWS Data Pipeline, etc…
• Batch was used when having troubles with other solutions. Is okay, but not very flexible
• We have a preference to stay cloud-agnostic as much as possible
20/02/2022
Severin Ryberg 9

Why not use a service?
o Apache Airflow
• High learning curve for pipeline developers
• Poor Kubernetes support
- Note! Prior to Airflow 2.0
• Still need to maintain the Kubernetes cluster yourself
o Prefect
• First tried option
• Early-stage start-up going through its own growing pains
- Note! This was in Jan – March 2021
• Change it cost model would drastically change our price point
o AWS Batch, AWS Glue, AWS Data Pipeline, etc…
• Batch was used when having troubles with other solutions. Is okay, but not very flexible
• We have a preference to stay cloud-agnostic as much as possible
20/02/2022
Severin Ryberg 10

SETUP
KUBERNETES, ARGO, DASK, JOY!
20/02/2022
Severin Ryberg 11
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A

Stated Requirements:
o Python
o Black-box generic
o Shared development experience
o Batch- and shared parallel-processing
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
o Automated and versioned
The stack
20/02/2022
Severin Ryberg 12

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 13
Kubernetes

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 14
Kubernetes
K8S
Autoscaler

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 15
Kubernetes
K8S
Autoscaler
Argo
Workflows

Argo Workflows Overview
Argo Workflows is an open source container-native workflow engine
for orchestrating parallel jobs on Kubernetes. Argo Workflows is
implemented as a Kubernetes CRD.
o Define workflows where each step in the workflow is a container.
o Model multi-step workflows as a sequence of tasks or capture the
dependencies between tasks using a graph (DAG).
- Argo Workflows homepage
INTERMISSION
20/02/2022
Severin Ryberg 16
https://github.com/mcgrawia/argocon-21-demo

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 17
Kubernetes
K8S
Autoscaler
Argo
Workflows

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 18
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 19
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 20
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 21
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 22
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask

Dask Overview
o Dask: high-throughput data-pipelines in python
o Comes out of the box with
• Multi-domain execution
• Dask-dataframes & futures interface
o Supporting resources
• Dask development support from the community, and as a service
• Cluster-provisioning services (e.g. Coiled)
o ACCURE had to implement
• Work-avoidance
• Task artifacting (on S3)
• Logging to ELK
INTERMISSION II
20/02/2022
Severin Ryberg 23
https://docs.dask.org/en/latest/
https://docs.dask.org/en/latest/graphs.html

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 24
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 25
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask Coiled

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 26
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask Coiled
ACCURE Utilities

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The stack
20/02/2022
Severin Ryberg 27
Kubernetes
K8S
Autoscaler
Argo
Workflows
ELK
Prome
theus
Postgresql S3
Pulumi
Gitlab
Dask Coiled
ACCURE Utilities
ACCURE Data Pipelines

Combining Dask and Argo Workflows
20/02/2022
Severin Ryberg 28
Pipeline configuration (Workflow Template)
Standard dask scaler (Cluster Workflow Template)
Initiate workflow
Dask Scheduler
(daemon)
Dask Worker Deployment
(daemon)
Worker tear down
(on exit)
Primary pipeline

20/02/2022
Severin Ryberg 29
Initiate workflow
Dask Scheduler
(daemon)
(daemon)
Worker tear down
(on exit)
Primary pipeline
Image repo, pod resources (workers,
scheduler, pipeline), pipeline settings

20/02/2022
Severin Ryberg 30
Initiate workflow
Dask Scheduler
(daemon)
(daemon)
Worker tear down
(on exit)
Primary pipeline

20/02/2022
Severin Ryberg 31
Initiate workflow
Dask Scheduler
(daemon)
(daemon)
Worker tear down
(on exit)
Primary pipeline
Image tag, worker count,
pipeline arguments

20/02/2022
Severin Ryberg 32
Initiate workflow
Dask Scheduler
(daemon)
(daemon)
Worker tear down
(on exit)
Primary pipeline
pipeline arguments
“Product-specific Dask image”
Python:3.8.8-buster
● Pip-install dask, distributed, bokeh, and
whatever else you need
● Activate environment (e.g. conda)
● Ensure “dask-scheduler’ and “dask-
worker” are on path
● Inject python script

20/02/2022
Severin Ryberg 33
Initiate workflow
Dask Scheduler
(daemon)
(daemon)
Worker tear down
(on exit)
Primary pipeline
pipeline arguments
“Product-specific Dask image”
Python:3.8.8-buster
● Pip-install dask, distributed, bokeh, and
whatever else you need
● Activate environment (e.g. conda)
● Ensure “dask-scheduler’ and “dask-
worker” are on path
● Inject python script

Example Dask Pipeline - Today’s example
o Timeseries weather data in Spain
• Madrid, Valencia, Barcelona,
Seville, Bilbao
o Which is the windiest city!?
20/02/2022
Severin Ryberg 34
Get all timestamps
Get all timestamps
Get all timestamps
Get all timestamps
Identify windiest city at
timestamp
Count windiest city
observations
Report
pipeline
dask worker
Computed in…

DEMO
THE PROOF IS IN THE PUDDING
20/02/2022
Severin Ryberg 35
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A

Git Repo
o Code is available at: https://github.com/pipekit/argocon-demos
• Created with the help of the Pipekit team! (https://pipekit.io/)
o Contains:
• Sample Argo workflows installation
- Workload is too heavy for Docker Desktop, Minikube, etc. - use cloud k8s
• Cluster workflow template containing Dask pipeline
• Workflow template that invokes the Dask pipeline
• Dockerized python pipeline that schedules tasks on Dask
• Sample weather data for major cities in Spain
20/02/2022
Severin Ryberg 36

o Python
o Black-box generic
o Self-service
o Ultra-low latency
o Multi-tenant
o Cost efficient
o Secure
o Scheduling
o Logs
o Archiving
o Data throughput
The demo stack
20/02/2022
Severin Ryberg 37
Kubernetes
Argo
Workflows
Dask
Simple Data
Pipeline

Kubernetes Cluster Internals - Demo
o Any “substantial” Kubernetes cluster should work
• Locally-hosted K3S, minikube, or Docker Desktop don’t work
o Cluster wide Argo-Workflows installation
• Provision a namespace for each customer
• Operators have access to creating, updating, deleting
Argo Workflows and Workflow Templates in namespaces
o Customer-specific namespaces
• Workflow templates differentiated by namespace,
configured for specific customer context (i.e. IAM secret)
• Cron workflows trigger workflow templates
• Customer-specific credentials and other secrets isolated
to namespace
20/02/2022
Severin Ryberg 38
Kubernetes cluster
Cluster scope
Customer-
namespaces
Workflow
Template
CRON Wf
Template
Customer
specific config
Argo server
Controller
config-map
Argo
controller
Cluster Wf
Template
RoleBinding
ClusterRole

We’re Live
20/02/2022
Severin Ryberg 39

Closing remarks
o When to use Argo Workflows & Dask vs when not to:
1. Scheduled vs. “ad-hoc” computing
- Scheduled / automated workflows? Argo and Dask is great!
- Are humans in the loop? If so, a Dask-as-a-service may be better (e.g. Coiled)
 Provides for fast development iteration
 Empowers self-service of data engineers
2. Dask is not a silver bullet
- Some instabilities for long-running tasks (hours, yes - days, no)
- Can do batch processing, but not the main focus
- Dask is an active project. Community involvement / interest can help improve this!
o Dask Alternatives (no significant personal experience)
• Couler – Develop Argo workflows directly in python
• Ray – Low level parallelization engine in Python
20/02/2022
Severin Ryberg 40

Q & A
r/RoastMe
20/02/2022
Severin Ryberg 41
INTRODUCTION
MOTIVATION
SETUP
DEMO
Q & A
s.ryberg@accure.net
Github: sevberg
linkedIn: david-severin-ryberg

Dok Talks #111 - Scheduled Scaling with Dask and Argo Workflows

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dok Talks #111 - Scheduled Scaling with Dask and Argo Workflows

Similar to Dok Talks #111 - Scheduled Scaling with Dask and Argo Workflows (20)

More from DoKC

More from DoKC (20)

Recently uploaded

Recently uploaded (20)

Dok Talks #111 - Scheduled Scaling with Dask and Argo Workflows