7 WAYS TO EXECUTE
SCHEDULED JOBS
WITH PYTHON
GUEST POST: TIM MUGAYI
JOB SCHEDULING
Job scheduling is a common programming challenge that
most organizations and developers at some point must
tackle in order to solve critical problems. This is further
exacerbated by the proliferation of big data and training
models for machine learning.
Havingtheabilitytocrunchpetabytesofdataonapredictable
and automated basis in order to derive insights is a key
driving factor to set you apart from the competition.
There are various opensource solutions, such as Hadoop
and Apache-Spark, and proprietary vendor solutions, such
as AWS Glue, used to handle these large data sets. One
key component that is required by all these technologies
is a “job scheduler” that gives the ability to trigger events
in predefined time intervals in a fault-tolerant and scalable
way.
SCHEDULED
JOBS
ARE AUTOMATED PIECES OF WORK THAT CAN BE PERFORMED AT A SPECIFIC TIME OR ON A RECURRING BASIS
WHICH PREDOMINATELY WORK WITH UNIX STYLE EXPRESSIONS CALLED CRON. THESE ARE TIME-BASED EVENT
TRIGGERS, WHICH ENABLE APPLICATIONS TO SCHEDULE A WORK TO BE PERFORMED AT A CERTAIN DATE OR
TIME BASED ON CRON EXPRESSIONS.
Many applications need to schedule routine tasks like system
maintenance, administration, taking a daily backup of data or
sending emails. If you code often, there will always be a need to run
some event or task at a predefined time period. A scheduled job
can be synchronous or asynchronous, spanning any arbitrary time
frame. The infrastructure that it was scheduled on might be entirely
different than that on which it runs.
APPLICATIONS
PREREQUISITES
The objective of this slide deck is to outline innovative
options that you have available at your disposal when
crafting out your next Job scheduler in Python. Then you
can immediately start automating your python and data
science solutions.
In order to follow along ensure you have Python ≥ 3.5
installed with Anaconda environment or python virtual
environment configured, so you can run some of the
sample codes to see how some of the libraries work.
OPTIONS
APSCHEDULER1 5
2 6
3 7
4
CRONTAB
AWS CRON
CELERY PERIODICAL
TASKS
TIMELOOP
QUEUE-BASED
DECOUPLED SCHEDULING
APACHE AIRFLOW
APSCHEDULER1 MEMORY
(Host machine in-memory scheduler)
SQL ALCHEMY
(Any RDBMS supported by SQLAlchemy)
MONGO DB
(NoSQL database)
REDIS
(In-memory key-value pair data structure store)
1
This is probably one of the easiest ways you can add a
cron-like scheduler into your web-based or standalo-
ne python applications. This library is pretty easy to get
started with and offers multiple backends, also known as
job-stores, such as:
RETHINKDB5
4
3
6 ZOOKEEPER
...
2
LOREN IP-
Backends provide a storage location where
you can persist your triggers. For example, if
you set your python script to execute every
day at 5 pm, you have created a trigger.
If you shut down your program or if your
program terminates unexpectedly — upon
resuming your script, this data can be read
from your persistence store and continue
firing your python script as per your
configured schedule.
Trigger stores also make sense in situations
where you do not wish to hard code triggers
or utilize redeployment cycles. They give the
option to dynamically change such triggers
through backends or offer users to change
triggers via a user interface. Your choice of
backends entirely depends on your stack.
APSCHEDULER1
optional start/end
times
Cron-style
Scheduling
APScheduler offers three basic scheduling systems that
should meet most of your job scheduler needs:
run jobs on even
intervals, with optional
start/end times
Interval-based
Execution
One-off Delayed
Execution
run jobs once, on a set
date/time
...continued
Cron is a utility that allows us
to schedule tasks in Unix-based
systemsusingCronexpressions.
The tasks in cron are defined in
a crontab, which is a text file
containing the commands to be
executed. The syntax used in a
crontab is described below in
this article.
Python presents us with the
crontab module to manage
scheduled jobs via cron. The
functions available in it allow
you to access cron, create jobs,
set restrictions, remove jobs, and
more without having to manually
write crontab files yourself. In
this article, we will show how to
use these operations from within
your Python code.
Cron uses a specific syntax to
define the time schedules. It
consists of five fields, which are
separated by white spaces.
The Python module crontab
provides us with a handy tool to
programmatically manage our
cron entries which are available
to Unix-like systems. Using the
python interface makes life easier
than having to rely on creating
crontabs manually. For more
details on all you can do with
crontab, you can read up on the
API documentation.
CronTab2
Thisoptionistheconventionalapproach
on AWS to create cron triggers. The
approachleveragesAWSLambda,which
is being triggered by the Cloudwatch
cron event trigger. Since lambda is being
leveraged here, your python code is
bound to the limitation of lambda.
Amazon Elastic Container Service
(Amazon ECS) is a fully-managed
container orchestration service. If you
choose to use this approach — you
have to be comfortable using docker.
The idea of building ECS schedule
tasks is similar to that of using lambda,
but lambda is only used to trigger the
execution of your ECS task definition
which corresponds to a docker image
hosting your python code. CloudWatch
events are still used as the cron trigger.
If you take the concepts of CloudWatch
event triggers, and mash them up with
lambda and EC2 instances, you get
this approach. When you need your
python code to do more, perhaps some
CPU intensive task that requires more
resources, but don’t need the benefits
of Serverless, this approach might make
sense.
CloudWatch Events
with Lambada
AWS Cron Jobs3
ECS Scheduled Tasks CloudWatch Events with
Lambda and EC2
AWS Batch
If AWS is your primary development environment and you’re not concerned with vendor lock, you
have a couple of options at your disposal to get your python code working in a cron-like fashion.
This option includes more complicated
tasksthatspanbeyondthelimitationsof
serverlessortasksthatcantakehoursor
days to complete. AWS batch manages
the scheduling and provisioning of
the work. You can define multi-stage
pipelines where each stage depends
on the completion of the previous
one. AWS batch works on a queue and
initializes EC2 instances on a per need
basis. The benefits of this approach are
you only pay for what you use, and the
managing of instances is done for you.
Celery is a python framework that allows distributed processing
of tasks in an asynchronous fashion via message brokers such
as RabbitMQ, SQS, and Redis. Celery has been built around the
consumer-producer FIFO queue design pattern. Though mainly
used for such use cases, it has an built-in scheduler that you
can take advantage of named beat.
Beat, as the name implies, is a scheduler that places
messages in a message broker queue when the predefined
time interval is reached-either via basic intervals or complex
cron expressions. Once beat places messages in the message
broker, they become available for consumption by the next
available Celery worker.
Celery Periodicial
Tasks
4
Something to take note of is that jobs may overlap if
the job does not complete before the next is triggered. This
is something to keep in mind whenever you craft out your job
scheduler. Things like semaphores and Redis locks can be used
to mitigate this behavior if it’s not desired.
Timeloop5
Timeloop is a library that can be used to run multiple period tasks. This is a simple library that
uses a decorator pattern for running tagged functions in threads. If you are looking to take ad-
vantage of cores, this might not be the library to use. It’s sufficient enough for simple use cases
if you don’t need a full-blown framework or you need something simple to incorporate into your
web or standalone python applications. To get started using the library, install it via pip.
The core feature in this design is the
producer, which handles the cron events
and publishes them to an exchange.
Scheduled workers can simply bind to a
shared queue. This design is implement-
ed with an AMQP system such as Rabbit-
MQ, is vendor-neutral, and can be multi-
cloud. Python has RabbitMQ clients such
as pika that are easy to work with and get
started.
The queue-based job scheduler design
is a clean way to decouple producers
and consumers. It also solves a few of
the issues with many of the traditional
schedulers not having replay mechanism.
Queue-based job schedulers are useful
in the event you outgrow simple cron-
based schedulers. There are instances
where you require a little bit more out
of your python-scheduled jobs such as:
•	 Job schedules that need to handle
complex relationships between jobs
(e.g one job triggers another job) such
as a state machine; or workflow-based
scheduled jobs such as big data ETL
scheduling.
•	 You need to have complex retry
mechanisms for failed scheduled jobs
with reporting and alert mechanisms.
PRODUCER
Queue-Based
Decoupled Scheduling
6
If you have a need to add predictability
and redundancy to your scheduled jobs,
you can opt to build out a distributed job
scheduler by utilizing AMQP, RabbitMQ
or any queue stack of your choosing.
This gives you the ability to scale your job
scheduler. Leveraging on queue delegates
work across consumers via an exchange
that determines what kind of messages-
the queue should get, depending on the
exchange type and some of the queue
parameters.
An exchange is effectively a safe place
to publish messages that are decoupled
from the producer and provide all needed
information about available consumers.
The exchange will take a message and for-
ward it along.
USE
Airflow’s Direct Acrylic Graph, DAG for short, offers a
way to build and schedule complex and dynamic data
workflows using python. Airflow is mostly known for
its ability to build out workflows that tap into external
resources such as RDBMS databases or custom scripts
to perform ETL related data transformation and cleanup.
Though that’s not the only thing it’s good for. With a few
configurations in your python code, you can build out a
pretty robust job scheduler that you can integrate easily
with Dask and other machine learning frameworks.
Apache Airflow7
DAGs Operators Executors
Stay up to date with Saturn Cloud on LinkedIn and Twitter.
You may also be interested in: Linear Models in Python.
With the tools and options presented in this article, managing your scheduled
jobs does not have to be tedious. You can quickly start building out job schedul-
ers that can help you automate most, if not all, of your data science pipelines and
build-out ETL job schedulers that perform data extraction. It can help you pipe
extracted data into services such as Dask, offered by Saturn Cloud, which makes
scaling your data science, deep learning and machine learning models a lot easier.
Original blog post here.
THANK YOU!
SATURN CLOUD
33 IRVING PL
NEW YORK, NY 10003
SUPPORT@SATURNCLOUD.IO
(831) 228-8739

7 ways to execute scheduled jobs with python

  • 1.
    7 WAYS TOEXECUTE SCHEDULED JOBS WITH PYTHON GUEST POST: TIM MUGAYI
  • 2.
    JOB SCHEDULING Job schedulingis a common programming challenge that most organizations and developers at some point must tackle in order to solve critical problems. This is further exacerbated by the proliferation of big data and training models for machine learning. Havingtheabilitytocrunchpetabytesofdataonapredictable and automated basis in order to derive insights is a key driving factor to set you apart from the competition. There are various opensource solutions, such as Hadoop and Apache-Spark, and proprietary vendor solutions, such as AWS Glue, used to handle these large data sets. One key component that is required by all these technologies is a “job scheduler” that gives the ability to trigger events in predefined time intervals in a fault-tolerant and scalable way. SCHEDULED JOBS ARE AUTOMATED PIECES OF WORK THAT CAN BE PERFORMED AT A SPECIFIC TIME OR ON A RECURRING BASIS WHICH PREDOMINATELY WORK WITH UNIX STYLE EXPRESSIONS CALLED CRON. THESE ARE TIME-BASED EVENT TRIGGERS, WHICH ENABLE APPLICATIONS TO SCHEDULE A WORK TO BE PERFORMED AT A CERTAIN DATE OR TIME BASED ON CRON EXPRESSIONS.
  • 3.
    Many applications needto schedule routine tasks like system maintenance, administration, taking a daily backup of data or sending emails. If you code often, there will always be a need to run some event or task at a predefined time period. A scheduled job can be synchronous or asynchronous, spanning any arbitrary time frame. The infrastructure that it was scheduled on might be entirely different than that on which it runs. APPLICATIONS PREREQUISITES The objective of this slide deck is to outline innovative options that you have available at your disposal when crafting out your next Job scheduler in Python. Then you can immediately start automating your python and data science solutions. In order to follow along ensure you have Python ≥ 3.5 installed with Anaconda environment or python virtual environment configured, so you can run some of the sample codes to see how some of the libraries work.
  • 4.
    OPTIONS APSCHEDULER1 5 2 6 37 4 CRONTAB AWS CRON CELERY PERIODICAL TASKS TIMELOOP QUEUE-BASED DECOUPLED SCHEDULING APACHE AIRFLOW
  • 5.
    APSCHEDULER1 MEMORY (Host machinein-memory scheduler) SQL ALCHEMY (Any RDBMS supported by SQLAlchemy) MONGO DB (NoSQL database) REDIS (In-memory key-value pair data structure store) 1 This is probably one of the easiest ways you can add a cron-like scheduler into your web-based or standalo- ne python applications. This library is pretty easy to get started with and offers multiple backends, also known as job-stores, such as: RETHINKDB5 4 3 6 ZOOKEEPER ... 2
  • 6.
    LOREN IP- Backends providea storage location where you can persist your triggers. For example, if you set your python script to execute every day at 5 pm, you have created a trigger. If you shut down your program or if your program terminates unexpectedly — upon resuming your script, this data can be read from your persistence store and continue firing your python script as per your configured schedule. Trigger stores also make sense in situations where you do not wish to hard code triggers or utilize redeployment cycles. They give the option to dynamically change such triggers through backends or offer users to change triggers via a user interface. Your choice of backends entirely depends on your stack. APSCHEDULER1 optional start/end times Cron-style Scheduling APScheduler offers three basic scheduling systems that should meet most of your job scheduler needs: run jobs on even intervals, with optional start/end times Interval-based Execution One-off Delayed Execution run jobs once, on a set date/time ...continued
  • 7.
    Cron is autility that allows us to schedule tasks in Unix-based systemsusingCronexpressions. The tasks in cron are defined in a crontab, which is a text file containing the commands to be executed. The syntax used in a crontab is described below in this article. Python presents us with the crontab module to manage scheduled jobs via cron. The functions available in it allow you to access cron, create jobs, set restrictions, remove jobs, and more without having to manually write crontab files yourself. In this article, we will show how to use these operations from within your Python code. Cron uses a specific syntax to define the time schedules. It consists of five fields, which are separated by white spaces. The Python module crontab provides us with a handy tool to programmatically manage our cron entries which are available to Unix-like systems. Using the python interface makes life easier than having to rely on creating crontabs manually. For more details on all you can do with crontab, you can read up on the API documentation. CronTab2
  • 8.
    Thisoptionistheconventionalapproach on AWS tocreate cron triggers. The approachleveragesAWSLambda,which is being triggered by the Cloudwatch cron event trigger. Since lambda is being leveraged here, your python code is bound to the limitation of lambda. Amazon Elastic Container Service (Amazon ECS) is a fully-managed container orchestration service. If you choose to use this approach — you have to be comfortable using docker. The idea of building ECS schedule tasks is similar to that of using lambda, but lambda is only used to trigger the execution of your ECS task definition which corresponds to a docker image hosting your python code. CloudWatch events are still used as the cron trigger. If you take the concepts of CloudWatch event triggers, and mash them up with lambda and EC2 instances, you get this approach. When you need your python code to do more, perhaps some CPU intensive task that requires more resources, but don’t need the benefits of Serverless, this approach might make sense. CloudWatch Events with Lambada AWS Cron Jobs3 ECS Scheduled Tasks CloudWatch Events with Lambda and EC2 AWS Batch If AWS is your primary development environment and you’re not concerned with vendor lock, you have a couple of options at your disposal to get your python code working in a cron-like fashion. This option includes more complicated tasksthatspanbeyondthelimitationsof serverlessortasksthatcantakehoursor days to complete. AWS batch manages the scheduling and provisioning of the work. You can define multi-stage pipelines where each stage depends on the completion of the previous one. AWS batch works on a queue and initializes EC2 instances on a per need basis. The benefits of this approach are you only pay for what you use, and the managing of instances is done for you.
  • 9.
    Celery is apython framework that allows distributed processing of tasks in an asynchronous fashion via message brokers such as RabbitMQ, SQS, and Redis. Celery has been built around the consumer-producer FIFO queue design pattern. Though mainly used for such use cases, it has an built-in scheduler that you can take advantage of named beat. Beat, as the name implies, is a scheduler that places messages in a message broker queue when the predefined time interval is reached-either via basic intervals or complex cron expressions. Once beat places messages in the message broker, they become available for consumption by the next available Celery worker. Celery Periodicial Tasks 4 Something to take note of is that jobs may overlap if the job does not complete before the next is triggered. This is something to keep in mind whenever you craft out your job scheduler. Things like semaphores and Redis locks can be used to mitigate this behavior if it’s not desired.
  • 10.
    Timeloop5 Timeloop is alibrary that can be used to run multiple period tasks. This is a simple library that uses a decorator pattern for running tagged functions in threads. If you are looking to take ad- vantage of cores, this might not be the library to use. It’s sufficient enough for simple use cases if you don’t need a full-blown framework or you need something simple to incorporate into your web or standalone python applications. To get started using the library, install it via pip.
  • 11.
    The core featurein this design is the producer, which handles the cron events and publishes them to an exchange. Scheduled workers can simply bind to a shared queue. This design is implement- ed with an AMQP system such as Rabbit- MQ, is vendor-neutral, and can be multi- cloud. Python has RabbitMQ clients such as pika that are easy to work with and get started. The queue-based job scheduler design is a clean way to decouple producers and consumers. It also solves a few of the issues with many of the traditional schedulers not having replay mechanism. Queue-based job schedulers are useful in the event you outgrow simple cron- based schedulers. There are instances where you require a little bit more out of your python-scheduled jobs such as: • Job schedules that need to handle complex relationships between jobs (e.g one job triggers another job) such as a state machine; or workflow-based scheduled jobs such as big data ETL scheduling. • You need to have complex retry mechanisms for failed scheduled jobs with reporting and alert mechanisms. PRODUCER Queue-Based Decoupled Scheduling 6 If you have a need to add predictability and redundancy to your scheduled jobs, you can opt to build out a distributed job scheduler by utilizing AMQP, RabbitMQ or any queue stack of your choosing. This gives you the ability to scale your job scheduler. Leveraging on queue delegates work across consumers via an exchange that determines what kind of messages- the queue should get, depending on the exchange type and some of the queue parameters. An exchange is effectively a safe place to publish messages that are decoupled from the producer and provide all needed information about available consumers. The exchange will take a message and for- ward it along. USE
  • 12.
    Airflow’s Direct AcrylicGraph, DAG for short, offers a way to build and schedule complex and dynamic data workflows using python. Airflow is mostly known for its ability to build out workflows that tap into external resources such as RDBMS databases or custom scripts to perform ETL related data transformation and cleanup. Though that’s not the only thing it’s good for. With a few configurations in your python code, you can build out a pretty robust job scheduler that you can integrate easily with Dask and other machine learning frameworks. Apache Airflow7 DAGs Operators Executors
  • 13.
    Stay up todate with Saturn Cloud on LinkedIn and Twitter. You may also be interested in: Linear Models in Python. With the tools and options presented in this article, managing your scheduled jobs does not have to be tedious. You can quickly start building out job schedul- ers that can help you automate most, if not all, of your data science pipelines and build-out ETL job schedulers that perform data extraction. It can help you pipe extracted data into services such as Dask, offered by Saturn Cloud, which makes scaling your data science, deep learning and machine learning models a lot easier. Original blog post here.
  • 14.
    THANK YOU! SATURN CLOUD 33IRVING PL NEW YORK, NY 10003 SUPPORT@SATURNCLOUD.IO (831) 228-8739