TaskFlow
and

OpenStack

Joshua Harlow
Yahoo!
The problem statement
‣

Statemanagementachitis
‣

Workflows today without state management in place
makes workflows hard to follow, alter and recover

‣

Unreliable workflow and resource state
‣

Distributed system correctness is a hard problem

‣

RPC boundaries are a constant balance of improving
scalability but decreasing consistency

‣

Race conditions occur more often than desired
Continued…
‣

Manager  Driver API boundary

‣

Organic growth of features and patches

‣

Application and state recovery typically patched
on after the fact instead of built-in from the
ground up (ex, periodic tasks)

‣

Capabilities to `service stop` an application cleanly
without manual (or periodic) clean up is crucial for
features like live upgrades
Why this matters???
‣

Customers expect stability and consistency
‣

API and service reliability

‣

Resource and/or state corruption (or people to fix
manually these problems) costs $$$

‣

Easily understood workflows and states allows the
development and alteration of existing workflows

‣

Upgrades (not even live), just upgrades
‣

Just say no to destroying the cloud to upgrade
‣

Pride: we can build a system that does
‣

We all want openstack cloud software to be very reliable
and if and when it does fail it should not cause
unrecoverable corruption

‣

Be the exception to the norm! 
Introducing: TaskFlow
A library for OpenStack that makes
task execution easy, consistent, and
reliable.
What it is
‣

A stackforge & pypi library
‣

‣

Developed by and for the community
‣

‣

Yahoo!, Grid Dynamics, Rackspace, AT&T, NTT …

Community driven & well documented
‣

‣

https://pypi.python.org/pypi/taskflow

https://wiki.openstack.org/wiki/TaskFlow

A paradigm and lightweight framework
What it is not
‣

A webservice with a REST API
‣

‣

See sessions about mistral

Solution to all the problems
‣
‣

‣

Does not solve world peace
Will not deliver your rainbow ponies

Still requires understanding and careful coding
Foundational concepts
‣

Code structure (your applications frame)

‣

Controlled execution
‣

Who & what manages the overall execution

‣

Persistence (how you know what was executed)

‣

Work recovery
‣

How you recover from failure/partial progress
‣

Tasks
‣
‣

‣

Executes and reverts one action
Receives inputs and declares outputs

Flows
‣

Composes tasks (or subflows) into useful structures

‣

Imposes some definition of order onto the running of
your tasks or subflows
‣

Linear order, unordered, topological order…
Engines
‣

Runs your flow (and associated tasks) in a well
defined, reliable, consistent and resumable manner
‣

Follows well defined state transitions

‣

Allows for deployers/developers of a service that uses
taskflow to select an engine that suites their setup best

‣

Backed by varying implementations
‣

Single-threaded

‣

Multi-threaded via native or green threads

‣

Distributed (WIP)
Persistence
‣

Saves task state/progress/results and flow state

‣

Allows for reconstruction and resumption of flows
and associated tasks

‣

Allows the user to view the play-by-play action
history of flows and associated tasks
‣

‣

Facilitates debugging of taskflow usage and integration

Backed by varying implementations
‣

File system, memory, database…
‣

Jobs
‣

The initial (and any derivative) set of tasks & flows
required to fulfill an action
‣
‣

‣

Can be transferred to a worker for completion
Can be re-associated on worker failure (or timeout) for
resumption or undo/reversion

Job board
‣

A system where jobs can be atomically posted,
reposted, claimed, marked as completed…

‣

Backed by varying implementations
‣

Message queue, zookeeper, database…
What exists
‣

Release 0.1
‣

Tagged & pypi uploaded on October 24, 2013

‣

Contains foundational concepts
‣

‣

Tasks, flows, resumption, persistence, local engines

Excellent documentation!
‣

Best practices, inputs & outputs, examples,
state transitions …

‣

Design – engines, persistence, primitives …
What’s missing
‣

Distributed engine

‣

Lock service
‣

Ensure your resources are not simultaneously trampled

‣

Zookeeper storage layer

‣

Job and job board reference implementation
‣

Currently being built out
Examples
‣

Miniature nova
‣

‣

Minature cinder
‣

‣

Try to control-c and then restart

Parallel volume creation
‣

‣

Try to control-c and then restart

Runs in parallel!

More …
Havana development
‣

Cinder
‣
‣

‣

Create volume workflow uses taskflow!!
Continued integration: sessions [1]

Billingstack
‣

‣

Being used in payment methods

TaskFlow (itself)
‣

Only started in early May of 2013
Planned integration
‣

Nova
‣
‣

‣

Under discussion
Sessions [1, 2]

Glance
‣
‣

‣

Under discussion
Sessions [1]

Trove, heat, quantum, rally, your project (?)
Mistral (recently announced!)
‣

Mistral service provides a convenient API based
on simple generic DSL for executing any task
flows
‣

‣

Use cases
‣

‣

Targeted at providing various scheduling and
orchestration capabilities for generic computational
tasks

Cloud cron, deployment & configuration management,
analytics & reporting…

Implements the convection proposal
Get involved!
‣

Developers wanted!
‣

Want to help taskflow get integrated quicker?
‣

‣

‣

More reliable, consistent openstack == better!

Want to help build taskflow 0.2+

Features wanted!
‣

Have a neat feature to implement?

‣

Have a neat use-case that currently is not satisfied?
‣

Weekly meetings
‣

http://wiki.openstack.org/wiki/Meetings#State_manage
ment_team_meeting

‣

Launchpad: http://launchpad.net/taskflow

‣

BPs: http://blueprints.launchpad.net/taskflow

‣

Code: http://github.com/stackforge/taskflow

‣

Wiki/docs: http://wiki.openstack.org/TaskFlow

‣

IRC at #openstack-state-management
?? Questions ??

Taskflow

  • 1.
  • 2.
    The problem statement ‣ Statemanagementachitis ‣ Workflowstoday without state management in place makes workflows hard to follow, alter and recover ‣ Unreliable workflow and resource state ‣ Distributed system correctness is a hard problem ‣ RPC boundaries are a constant balance of improving scalability but decreasing consistency ‣ Race conditions occur more often than desired
  • 3.
    Continued… ‣ Manager  DriverAPI boundary ‣ Organic growth of features and patches ‣ Application and state recovery typically patched on after the fact instead of built-in from the ground up (ex, periodic tasks) ‣ Capabilities to `service stop` an application cleanly without manual (or periodic) clean up is crucial for features like live upgrades
  • 4.
    Why this matters??? ‣ Customersexpect stability and consistency ‣ API and service reliability ‣ Resource and/or state corruption (or people to fix manually these problems) costs $$$ ‣ Easily understood workflows and states allows the development and alteration of existing workflows ‣ Upgrades (not even live), just upgrades ‣ Just say no to destroying the cloud to upgrade
  • 5.
    ‣ Pride: we canbuild a system that does ‣ We all want openstack cloud software to be very reliable and if and when it does fail it should not cause unrecoverable corruption ‣ Be the exception to the norm! 
  • 7.
    Introducing: TaskFlow A libraryfor OpenStack that makes task execution easy, consistent, and reliable.
  • 8.
    What it is ‣ Astackforge & pypi library ‣ ‣ Developed by and for the community ‣ ‣ Yahoo!, Grid Dynamics, Rackspace, AT&T, NTT … Community driven & well documented ‣ ‣ https://pypi.python.org/pypi/taskflow https://wiki.openstack.org/wiki/TaskFlow A paradigm and lightweight framework
  • 9.
    What it isnot ‣ A webservice with a REST API ‣ ‣ See sessions about mistral Solution to all the problems ‣ ‣ ‣ Does not solve world peace Will not deliver your rainbow ponies Still requires understanding and careful coding
  • 10.
    Foundational concepts ‣ Code structure(your applications frame) ‣ Controlled execution ‣ Who & what manages the overall execution ‣ Persistence (how you know what was executed) ‣ Work recovery ‣ How you recover from failure/partial progress
  • 12.
    ‣ Tasks ‣ ‣ ‣ Executes and revertsone action Receives inputs and declares outputs Flows ‣ Composes tasks (or subflows) into useful structures ‣ Imposes some definition of order onto the running of your tasks or subflows ‣ Linear order, unordered, topological order…
  • 13.
    Engines ‣ Runs your flow(and associated tasks) in a well defined, reliable, consistent and resumable manner ‣ Follows well defined state transitions ‣ Allows for deployers/developers of a service that uses taskflow to select an engine that suites their setup best ‣ Backed by varying implementations ‣ Single-threaded ‣ Multi-threaded via native or green threads ‣ Distributed (WIP)
  • 14.
    Persistence ‣ Saves task state/progress/resultsand flow state ‣ Allows for reconstruction and resumption of flows and associated tasks ‣ Allows the user to view the play-by-play action history of flows and associated tasks ‣ ‣ Facilitates debugging of taskflow usage and integration Backed by varying implementations ‣ File system, memory, database…
  • 15.
    ‣ Jobs ‣ The initial (andany derivative) set of tasks & flows required to fulfill an action ‣ ‣ ‣ Can be transferred to a worker for completion Can be re-associated on worker failure (or timeout) for resumption or undo/reversion Job board ‣ A system where jobs can be atomically posted, reposted, claimed, marked as completed… ‣ Backed by varying implementations ‣ Message queue, zookeeper, database…
  • 16.
    What exists ‣ Release 0.1 ‣ Tagged& pypi uploaded on October 24, 2013 ‣ Contains foundational concepts ‣ ‣ Tasks, flows, resumption, persistence, local engines Excellent documentation! ‣ Best practices, inputs & outputs, examples, state transitions … ‣ Design – engines, persistence, primitives …
  • 17.
    What’s missing ‣ Distributed engine ‣ Lockservice ‣ Ensure your resources are not simultaneously trampled ‣ Zookeeper storage layer ‣ Job and job board reference implementation ‣ Currently being built out
  • 18.
    Examples ‣ Miniature nova ‣ ‣ Minature cinder ‣ ‣ Tryto control-c and then restart Parallel volume creation ‣ ‣ Try to control-c and then restart Runs in parallel! More …
  • 19.
    Havana development ‣ Cinder ‣ ‣ ‣ Create volumeworkflow uses taskflow!! Continued integration: sessions [1] Billingstack ‣ ‣ Being used in payment methods TaskFlow (itself) ‣ Only started in early May of 2013
  • 20.
    Planned integration ‣ Nova ‣ ‣ ‣ Under discussion Sessions[1, 2] Glance ‣ ‣ ‣ Under discussion Sessions [1] Trove, heat, quantum, rally, your project (?)
  • 21.
    Mistral (recently announced!) ‣ Mistralservice provides a convenient API based on simple generic DSL for executing any task flows ‣ ‣ Use cases ‣ ‣ Targeted at providing various scheduling and orchestration capabilities for generic computational tasks Cloud cron, deployment & configuration management, analytics & reporting… Implements the convection proposal
  • 22.
    Get involved! ‣ Developers wanted! ‣ Wantto help taskflow get integrated quicker? ‣ ‣ ‣ More reliable, consistent openstack == better! Want to help build taskflow 0.2+ Features wanted! ‣ Have a neat feature to implement? ‣ Have a neat use-case that currently is not satisfied?
  • 23.
    ‣ Weekly meetings ‣ http://wiki.openstack.org/wiki/Meetings#State_manage ment_team_meeting ‣ Launchpad: http://launchpad.net/taskflow ‣ BPs:http://blueprints.launchpad.net/taskflow ‣ Code: http://github.com/stackforge/taskflow ‣ Wiki/docs: http://wiki.openstack.org/TaskFlow ‣ IRC at #openstack-state-management
  • 24.