Presentation on how to chat with PDF using ChatGPT code interpreter
7 ways to execute scheduled jobs with python
1. 7 WAYS TO EXECUTE
SCHEDULED JOBS
WITH PYTHON
GUEST POST: TIM MUGAYI
2. JOB SCHEDULING
Job scheduling is a common programming challenge that
most organizations and developers at some point must
tackle in order to solve critical problems. This is further
exacerbated by the proliferation of big data and training
models for machine learning.
Havingtheabilitytocrunchpetabytesofdataonapredictable
and automated basis in order to derive insights is a key
driving factor to set you apart from the competition.
There are various opensource solutions, such as Hadoop
and Apache-Spark, and proprietary vendor solutions, such
as AWS Glue, used to handle these large data sets. One
key component that is required by all these technologies
is a “job scheduler” that gives the ability to trigger events
in predefined time intervals in a fault-tolerant and scalable
way.
SCHEDULED
JOBS
ARE AUTOMATED PIECES OF WORK THAT CAN BE PERFORMED AT A SPECIFIC TIME OR ON A RECURRING BASIS
WHICH PREDOMINATELY WORK WITH UNIX STYLE EXPRESSIONS CALLED CRON. THESE ARE TIME-BASED EVENT
TRIGGERS, WHICH ENABLE APPLICATIONS TO SCHEDULE A WORK TO BE PERFORMED AT A CERTAIN DATE OR
TIME BASED ON CRON EXPRESSIONS.
3. Many applications need to schedule routine tasks like system
maintenance, administration, taking a daily backup of data or
sending emails. If you code often, there will always be a need to run
some event or task at a predefined time period. A scheduled job
can be synchronous or asynchronous, spanning any arbitrary time
frame. The infrastructure that it was scheduled on might be entirely
different than that on which it runs.
APPLICATIONS
PREREQUISITES
The objective of this slide deck is to outline innovative
options that you have available at your disposal when
crafting out your next Job scheduler in Python. Then you
can immediately start automating your python and data
science solutions.
In order to follow along ensure you have Python ≥ 3.5
installed with Anaconda environment or python virtual
environment configured, so you can run some of the
sample codes to see how some of the libraries work.
5. APSCHEDULER1 MEMORY
(Host machine in-memory scheduler)
SQL ALCHEMY
(Any RDBMS supported by SQLAlchemy)
MONGO DB
(NoSQL database)
REDIS
(In-memory key-value pair data structure store)
1
This is probably one of the easiest ways you can add a
cron-like scheduler into your web-based or standalo-
ne python applications. This library is pretty easy to get
started with and offers multiple backends, also known as
job-stores, such as:
RETHINKDB5
4
3
6 ZOOKEEPER
...
2
6. LOREN IP-
Backends provide a storage location where
you can persist your triggers. For example, if
you set your python script to execute every
day at 5 pm, you have created a trigger.
If you shut down your program or if your
program terminates unexpectedly — upon
resuming your script, this data can be read
from your persistence store and continue
firing your python script as per your
configured schedule.
Trigger stores also make sense in situations
where you do not wish to hard code triggers
or utilize redeployment cycles. They give the
option to dynamically change such triggers
through backends or offer users to change
triggers via a user interface. Your choice of
backends entirely depends on your stack.
APSCHEDULER1
optional start/end
times
Cron-style
Scheduling
APScheduler offers three basic scheduling systems that
should meet most of your job scheduler needs:
run jobs on even
intervals, with optional
start/end times
Interval-based
Execution
One-off Delayed
Execution
run jobs once, on a set
date/time
...continued
7. Cron is a utility that allows us
to schedule tasks in Unix-based
systemsusingCronexpressions.
The tasks in cron are defined in
a crontab, which is a text file
containing the commands to be
executed. The syntax used in a
crontab is described below in
this article.
Python presents us with the
crontab module to manage
scheduled jobs via cron. The
functions available in it allow
you to access cron, create jobs,
set restrictions, remove jobs, and
more without having to manually
write crontab files yourself. In
this article, we will show how to
use these operations from within
your Python code.
Cron uses a specific syntax to
define the time schedules. It
consists of five fields, which are
separated by white spaces.
The Python module crontab
provides us with a handy tool to
programmatically manage our
cron entries which are available
to Unix-like systems. Using the
python interface makes life easier
than having to rely on creating
crontabs manually. For more
details on all you can do with
crontab, you can read up on the
API documentation.
CronTab2
8. Thisoptionistheconventionalapproach
on AWS to create cron triggers. The
approachleveragesAWSLambda,which
is being triggered by the Cloudwatch
cron event trigger. Since lambda is being
leveraged here, your python code is
bound to the limitation of lambda.
Amazon Elastic Container Service
(Amazon ECS) is a fully-managed
container orchestration service. If you
choose to use this approach — you
have to be comfortable using docker.
The idea of building ECS schedule
tasks is similar to that of using lambda,
but lambda is only used to trigger the
execution of your ECS task definition
which corresponds to a docker image
hosting your python code. CloudWatch
events are still used as the cron trigger.
If you take the concepts of CloudWatch
event triggers, and mash them up with
lambda and EC2 instances, you get
this approach. When you need your
python code to do more, perhaps some
CPU intensive task that requires more
resources, but don’t need the benefits
of Serverless, this approach might make
sense.
CloudWatch Events
with Lambada
AWS Cron Jobs3
ECS Scheduled Tasks CloudWatch Events with
Lambda and EC2
AWS Batch
If AWS is your primary development environment and you’re not concerned with vendor lock, you
have a couple of options at your disposal to get your python code working in a cron-like fashion.
This option includes more complicated
tasksthatspanbeyondthelimitationsof
serverlessortasksthatcantakehoursor
days to complete. AWS batch manages
the scheduling and provisioning of
the work. You can define multi-stage
pipelines where each stage depends
on the completion of the previous
one. AWS batch works on a queue and
initializes EC2 instances on a per need
basis. The benefits of this approach are
you only pay for what you use, and the
managing of instances is done for you.
9. Celery is a python framework that allows distributed processing
of tasks in an asynchronous fashion via message brokers such
as RabbitMQ, SQS, and Redis. Celery has been built around the
consumer-producer FIFO queue design pattern. Though mainly
used for such use cases, it has an built-in scheduler that you
can take advantage of named beat.
Beat, as the name implies, is a scheduler that places
messages in a message broker queue when the predefined
time interval is reached-either via basic intervals or complex
cron expressions. Once beat places messages in the message
broker, they become available for consumption by the next
available Celery worker.
Celery Periodicial
Tasks
4
Something to take note of is that jobs may overlap if
the job does not complete before the next is triggered. This
is something to keep in mind whenever you craft out your job
scheduler. Things like semaphores and Redis locks can be used
to mitigate this behavior if it’s not desired.
10. Timeloop5
Timeloop is a library that can be used to run multiple period tasks. This is a simple library that
uses a decorator pattern for running tagged functions in threads. If you are looking to take ad-
vantage of cores, this might not be the library to use. It’s sufficient enough for simple use cases
if you don’t need a full-blown framework or you need something simple to incorporate into your
web or standalone python applications. To get started using the library, install it via pip.
11. The core feature in this design is the
producer, which handles the cron events
and publishes them to an exchange.
Scheduled workers can simply bind to a
shared queue. This design is implement-
ed with an AMQP system such as Rabbit-
MQ, is vendor-neutral, and can be multi-
cloud. Python has RabbitMQ clients such
as pika that are easy to work with and get
started.
The queue-based job scheduler design
is a clean way to decouple producers
and consumers. It also solves a few of
the issues with many of the traditional
schedulers not having replay mechanism.
Queue-based job schedulers are useful
in the event you outgrow simple cron-
based schedulers. There are instances
where you require a little bit more out
of your python-scheduled jobs such as:
• Job schedules that need to handle
complex relationships between jobs
(e.g one job triggers another job) such
as a state machine; or workflow-based
scheduled jobs such as big data ETL
scheduling.
• You need to have complex retry
mechanisms for failed scheduled jobs
with reporting and alert mechanisms.
PRODUCER
Queue-Based
Decoupled Scheduling
6
If you have a need to add predictability
and redundancy to your scheduled jobs,
you can opt to build out a distributed job
scheduler by utilizing AMQP, RabbitMQ
or any queue stack of your choosing.
This gives you the ability to scale your job
scheduler. Leveraging on queue delegates
work across consumers via an exchange
that determines what kind of messages-
the queue should get, depending on the
exchange type and some of the queue
parameters.
An exchange is effectively a safe place
to publish messages that are decoupled
from the producer and provide all needed
information about available consumers.
The exchange will take a message and for-
ward it along.
USE
12. Airflow’s Direct Acrylic Graph, DAG for short, offers a
way to build and schedule complex and dynamic data
workflows using python. Airflow is mostly known for
its ability to build out workflows that tap into external
resources such as RDBMS databases or custom scripts
to perform ETL related data transformation and cleanup.
Though that’s not the only thing it’s good for. With a few
configurations in your python code, you can build out a
pretty robust job scheduler that you can integrate easily
with Dask and other machine learning frameworks.
Apache Airflow7
DAGs Operators Executors
13. Stay up to date with Saturn Cloud on LinkedIn and Twitter.
You may also be interested in: Linear Models in Python.
With the tools and options presented in this article, managing your scheduled
jobs does not have to be tedious. You can quickly start building out job schedul-
ers that can help you automate most, if not all, of your data science pipelines and
build-out ETL job schedulers that perform data extraction. It can help you pipe
extracted data into services such as Dask, offered by Saturn Cloud, which makes
scaling your data science, deep learning and machine learning models a lot easier.
Original blog post here.