Prefect + Dask: Parallel / Distributed Workflows
Chris White
February 26, 2020
Chris White Workflows on Dask February 26, 2020 1 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Identification of failure (alerting)
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Identification of failure (alerting)
Recovery from failure
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Identification of failure (alerting)
Recovery from failure
Triggering logic (e.g., some tasks should be triggered by failed jobs)
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Identification of failure (alerting)
Recovery from failure
Triggering logic (e.g., some tasks should be triggered by failed jobs)
Retrying tasks is a first-class operation
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Identification of failure (alerting)
Recovery from failure
Triggering logic (e.g., some tasks should be triggered by failed jobs)
Retrying tasks is a first-class operation
Run-once guarantees
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Identification of failure (alerting)
Recovery from failure
Triggering logic (e.g., some tasks should be triggered by failed jobs)
Retrying tasks is a first-class operation
Run-once guarantees
Audit trails, lineage, access controls, an API
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Identification of failure (alerting)
Recovery from failure
Triggering logic (e.g., some tasks should be triggered by failed jobs)
Retrying tasks is a first-class operation
Run-once guarantees
Audit trails, lineage, access controls, an API
Scheduling features for both batch and ad-hoc runs
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
What do we mean by ”Workflows”?
Broadly speaking, when we talk about “workflow semantics” we mean:
“Tasks” represent units of business logic
Identification of failure (alerting)
Recovery from failure
Triggering logic (e.g., some tasks should be triggered by failed jobs)
Retrying tasks is a first-class operation
Run-once guarantees
Audit trails, lineage, access controls, an API
Scheduling features for both batch and ad-hoc runs
... and many more
Chris White Workflows on Dask February 26, 2020 2 / 6
Workflows
Where does Dask come in?
Chris White Workflows on Dask February 26, 2020 3 / 6
Workflows
Where does Dask come in?
Asynchronous scheduling of tasks
Chris White Workflows on Dask February 26, 2020 3 / 6
Workflows
Where does Dask come in?
Asynchronous scheduling of tasks
Parallelizing task execution
Chris White Workflows on Dask February 26, 2020 3 / 6
Workflows
Where does Dask come in?
Asynchronous scheduling of tasks
Parallelizing task execution
Distributing task execution
Chris White Workflows on Dask February 26, 2020 3 / 6
Workflows
Where does Dask come in?
Asynchronous scheduling of tasks
Parallelizing task execution
Distributing task execution
Submitting tasks to heterogeneous workers (worker resources)
Chris White Workflows on Dask February 26, 2020 3 / 6
Workflows
Where does Dask come in?
Asynchronous scheduling of tasks
Parallelizing task execution
Distributing task execution
Submitting tasks to heterogeneous workers (worker resources)
Creating of clusters on-demand / per-run (dask-kubernetes)
Chris White Workflows on Dask February 26, 2020 3 / 6
Workflows
Where does Dask come in?
Asynchronous scheduling of tasks
Parallelizing task execution
Distributing task execution
Submitting tasks to heterogeneous workers (worker resources)
Creating of clusters on-demand / per-run (dask-kubernetes)
... all off the shelf
Chris White Workflows on Dask February 26, 2020 3 / 6
Workflows
What Dask cares about
Chris White Workflows on Dask February 26, 2020 4 / 6
Workflows
What Prefect cares about
Chris White Workflows on Dask February 26, 2020 5 / 6
Some fun problems
Complications Opportunities
Chris White Workflows on Dask February 26, 2020 6 / 6
Some fun problems
Complications Opportunities
Approximately half of our community is not familiar with Dask
Chris White Workflows on Dask February 26, 2020 6 / 6
Some fun problems
Complications Opportunities
Approximately half of our community is not familiar with Dask
Dask is more willing to rerun tasks
Chris White Workflows on Dask February 26, 2020 6 / 6
Some fun problems
Complications Opportunities
Approximately half of our community is not familiar with Dask
Dask is more willing to rerun tasks
Sharing futures between Clients would be great
Chris White Workflows on Dask February 26, 2020 6 / 6
Some fun problems
Complications Opportunities
Approximately half of our community is not familiar with Dask
Dask is more willing to rerun tasks
Sharing futures between Clients would be great
Prefect currently submits large payloads to the scheduler
Chris White Workflows on Dask February 26, 2020 6 / 6
Some fun problems
Complications Opportunities
Approximately half of our community is not familiar with Dask
Dask is more willing to rerun tasks
Sharing futures between Clients would be great
Prefect currently submits large payloads to the scheduler
Creating Dask aware objects is hard
Chris White Workflows on Dask February 26, 2020 6 / 6
Some fun problems
Complications Opportunities
Approximately half of our community is not familiar with Dask
Dask is more willing to rerun tasks
Sharing futures between Clients would be great
Prefect currently submits large payloads to the scheduler
Creating Dask aware objects is hard
Resource configuration is an art not a science
Chris White Workflows on Dask February 26, 2020 6 / 6
Some fun problems
Complications Opportunities
Approximately half of our community is not familiar with Dask
Dask is more willing to rerun tasks
Sharing futures between Clients would be great
Prefect currently submits large payloads to the scheduler
Creating Dask aware objects is hard
Resource configuration is an art not a science
Prefect abdicates process control to Dask
Chris White Workflows on Dask February 26, 2020 6 / 6

Dask + Prefect

  • 1.
    Prefect + Dask:Parallel / Distributed Workflows Chris White February 26, 2020 Chris White Workflows on Dask February 26, 2020 1 / 6
  • 2.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: Chris White Workflows on Dask February 26, 2020 2 / 6
  • 3.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Chris White Workflows on Dask February 26, 2020 2 / 6
  • 4.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Identification of failure (alerting) Chris White Workflows on Dask February 26, 2020 2 / 6
  • 5.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Identification of failure (alerting) Recovery from failure Chris White Workflows on Dask February 26, 2020 2 / 6
  • 6.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Identification of failure (alerting) Recovery from failure Triggering logic (e.g., some tasks should be triggered by failed jobs) Chris White Workflows on Dask February 26, 2020 2 / 6
  • 7.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Identification of failure (alerting) Recovery from failure Triggering logic (e.g., some tasks should be triggered by failed jobs) Retrying tasks is a first-class operation Chris White Workflows on Dask February 26, 2020 2 / 6
  • 8.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Identification of failure (alerting) Recovery from failure Triggering logic (e.g., some tasks should be triggered by failed jobs) Retrying tasks is a first-class operation Run-once guarantees Chris White Workflows on Dask February 26, 2020 2 / 6
  • 9.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Identification of failure (alerting) Recovery from failure Triggering logic (e.g., some tasks should be triggered by failed jobs) Retrying tasks is a first-class operation Run-once guarantees Audit trails, lineage, access controls, an API Chris White Workflows on Dask February 26, 2020 2 / 6
  • 10.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Identification of failure (alerting) Recovery from failure Triggering logic (e.g., some tasks should be triggered by failed jobs) Retrying tasks is a first-class operation Run-once guarantees Audit trails, lineage, access controls, an API Scheduling features for both batch and ad-hoc runs Chris White Workflows on Dask February 26, 2020 2 / 6
  • 11.
    Workflows What do wemean by ”Workflows”? Broadly speaking, when we talk about “workflow semantics” we mean: “Tasks” represent units of business logic Identification of failure (alerting) Recovery from failure Triggering logic (e.g., some tasks should be triggered by failed jobs) Retrying tasks is a first-class operation Run-once guarantees Audit trails, lineage, access controls, an API Scheduling features for both batch and ad-hoc runs ... and many more Chris White Workflows on Dask February 26, 2020 2 / 6
  • 12.
    Workflows Where does Daskcome in? Chris White Workflows on Dask February 26, 2020 3 / 6
  • 13.
    Workflows Where does Daskcome in? Asynchronous scheduling of tasks Chris White Workflows on Dask February 26, 2020 3 / 6
  • 14.
    Workflows Where does Daskcome in? Asynchronous scheduling of tasks Parallelizing task execution Chris White Workflows on Dask February 26, 2020 3 / 6
  • 15.
    Workflows Where does Daskcome in? Asynchronous scheduling of tasks Parallelizing task execution Distributing task execution Chris White Workflows on Dask February 26, 2020 3 / 6
  • 16.
    Workflows Where does Daskcome in? Asynchronous scheduling of tasks Parallelizing task execution Distributing task execution Submitting tasks to heterogeneous workers (worker resources) Chris White Workflows on Dask February 26, 2020 3 / 6
  • 17.
    Workflows Where does Daskcome in? Asynchronous scheduling of tasks Parallelizing task execution Distributing task execution Submitting tasks to heterogeneous workers (worker resources) Creating of clusters on-demand / per-run (dask-kubernetes) Chris White Workflows on Dask February 26, 2020 3 / 6
  • 18.
    Workflows Where does Daskcome in? Asynchronous scheduling of tasks Parallelizing task execution Distributing task execution Submitting tasks to heterogeneous workers (worker resources) Creating of clusters on-demand / per-run (dask-kubernetes) ... all off the shelf Chris White Workflows on Dask February 26, 2020 3 / 6
  • 19.
    Workflows What Dask caresabout Chris White Workflows on Dask February 26, 2020 4 / 6
  • 20.
    Workflows What Prefect caresabout Chris White Workflows on Dask February 26, 2020 5 / 6
  • 21.
    Some fun problems ComplicationsOpportunities Chris White Workflows on Dask February 26, 2020 6 / 6
  • 22.
    Some fun problems ComplicationsOpportunities Approximately half of our community is not familiar with Dask Chris White Workflows on Dask February 26, 2020 6 / 6
  • 23.
    Some fun problems ComplicationsOpportunities Approximately half of our community is not familiar with Dask Dask is more willing to rerun tasks Chris White Workflows on Dask February 26, 2020 6 / 6
  • 24.
    Some fun problems ComplicationsOpportunities Approximately half of our community is not familiar with Dask Dask is more willing to rerun tasks Sharing futures between Clients would be great Chris White Workflows on Dask February 26, 2020 6 / 6
  • 25.
    Some fun problems ComplicationsOpportunities Approximately half of our community is not familiar with Dask Dask is more willing to rerun tasks Sharing futures between Clients would be great Prefect currently submits large payloads to the scheduler Chris White Workflows on Dask February 26, 2020 6 / 6
  • 26.
    Some fun problems ComplicationsOpportunities Approximately half of our community is not familiar with Dask Dask is more willing to rerun tasks Sharing futures between Clients would be great Prefect currently submits large payloads to the scheduler Creating Dask aware objects is hard Chris White Workflows on Dask February 26, 2020 6 / 6
  • 27.
    Some fun problems ComplicationsOpportunities Approximately half of our community is not familiar with Dask Dask is more willing to rerun tasks Sharing futures between Clients would be great Prefect currently submits large payloads to the scheduler Creating Dask aware objects is hard Resource configuration is an art not a science Chris White Workflows on Dask February 26, 2020 6 / 6
  • 28.
    Some fun problems ComplicationsOpportunities Approximately half of our community is not familiar with Dask Dask is more willing to rerun tasks Sharing futures between Clients would be great Prefect currently submits large payloads to the scheduler Creating Dask aware objects is hard Resource configuration is an art not a science Prefect abdicates process control to Dask Chris White Workflows on Dask February 26, 2020 6 / 6