3. What is it?
1. Schedules job periodically.
2. Granularity in minutes.
4. What is it?
1. Schedules job periodically.
2. Granularity in minutes. How do I retry
asynchronous
API requests
after few
seconds?
5. What is it?
1. Schedules job periodically.
2. Granularity in minutes. E-commerce
companies
need to change
prices on
demand.
6. What is it?
1. Schedules job periodically.
2. Granularity in minutes. Need to send
confirmation
message after
10 seconds?
7. What is it?
1. Schedules job periodically.
2. Granularity in minutes.
Event based
Scheduler is
the answer.
8. What is it?
1. Schedules job periodically.
2. Granularity in minutes.
3. Some jobs may take 30 minutes to execute.
Events take
less than a
second to
execute.
10. Functional Requirements
1. Job Dependency
a. Execute a job only if the most recent execution of its dependent jobs in the last 24 hrs is
successful.
14. Functional Requirements
1. Job Dependency
2. Job Retry
3. Job Timeout
a. If execution takes more than 1 hr, timeout the job.
b. Developers must break their jobs if it takes more than 1 hr.
16. Functional Requirements
1. Job Dependency
2. Job Retry
3. Job Timeout
4. Job Criticality
a. It can give a list of machines where it can be executed.
17. Functional Requirements
1. Job Dependency
2. Job Retry
3. Job Timeout
4. Job Criticality
a. It can give a list of machines where it can be executed.
And so we are
designing our
own scheduler!!
19. Non Functional Requirements
1. 5000 applications need the scheduler.
2. On an average, an application needs to execute 5 jobs every minute.
20. Non Functional Requirements
1. 5000 applications need the scheduler.
2. On an average, an application needs to execute 5 jobs every minute.
25000 jobs
every minute!!
21. Ideal Job?
1. Idempotent
a. f(f(x)) = f(x)
b. Re-running a job shouldn’t redo things. Payment done once shouldn’t be done again.
c. A job may execute well but the cron may not get the status.
d. A job may fail in between.
22. Ideal Job?
1. Idempotent
a. f(f(x)) = f(x)
b. Re-running a job shouldn’t redo things. Payment done once shouldn’t be done again.
c. A job may execute well but the cron may not get the status.
d. A job may fail in between.
2. Transactional
a. Commit changes together.
b. A job may fail in between.
23. Implementation
1. Application will provide machines for job execution.
a. Execution time will not be a bottleneck in scaling the cron scheduler.
24. Implementation
1. Application will provide machines for job execution.
a. Execution time will not be a bottleneck in scaling the cron scheduler.
2. To schedule a cron job, application will give following parameters:
a. Unique identifier (uuid)
b. Schedule time (*/10 * * * *)
c. Machines where the job can be executed ([{ip1, user1}, {ip2, user2}])
d. Command to execute the job (node <path> <arguments>)
e. Retry (time interval)
f. Dependent jobs ([uuid1, uuid2])
26. Implementation - Kafka
1. Topics that will get jobs to be executed
a. Cron scheduler will produce into it.
b. Each machine will have a kafka consumer to consume the jobs it needs to execute.
c. 1-1 Partition-machine mapping for each partition to gets jobs for the same machine in order.
d. 5000 applications will have at most 50000 machines and so 5 kafka brokers are needed.
e. Expiry of 1 hr as we have timeout of 1 hr.
27. Implementation - Kafka
1. Topics that will get jobs to be executed
a. Cron scheduler will produce into it.
b. Each machine will have a kafka consumer to consume the jobs it needs to execute.
c. 1-1 Partition-machine mapping for each partition to gets jobs for the same machine in order.
d. 5000 applications will have at most 50000 machines and so 5 kafka brokers are needed.
e. Expiry of 1 hr as we have timeout of 1 hr.
2. Topics that will get metaData of each job execution
a. Each machine will produce into it.
b. One partition needed as order is not important while consuming.
28. Implementation - RDBMS
1. Table cron_jobs
a. Store cron jobs with their metaData.
2. Table cron_jobs_scheduled
a. Contain cron jobs to be executed in next 10 minutes with trigger timestamp in minutes.
3. Table cron_jobs_executed
a. Each attempt of cron job execution with trigger timestamp and status (~1 hr data)
4. Table machines
a. Contain machine metaData with healthStatus, numOfJobsBeingExecuted and corresponding
kafka topic/partition
29. Implementation - Cache
1. Will help us implementing Job Dependency requirement.
2. Whenever a job fails store its uuid with 24hrs TTL.
3. When a job succeeds, remove it from cache.
4. In the worst case it will have all unique cron jobs
a. 5000 applications * 100 jobs per application = 500000*5 bytes (2.5 MB)
30. Implementation - Scheduler Jobs
1. A job that run every 10th minute:
a. Generate sql files containing 1 row per 1 cron job to be executed in the next 10 minutes.
b. at 13:10 it will generate sql file containing jobs to be executed at [13:11, 13:12, …, 13:20]
c. After generating it dumps the file in cron_jobs_scheduled (2.5 lakh entries takes < 1 sec.
31. Implementation - Scheduler Jobs
1. A job that run every 10th minute:
a. Generate sql files containing 1 row per 1 cron job to be executed in the next 10 minutes.
b. at 13:10 it will generate sql file containing jobs to be executed at [13:11, 13:12, …, 13:20]
c. After generating it dumps the file in cron_jobs_scheduled (2.5 lakh entries takes < 1 sec)
2. A job that runs every 5th minute:
a. Get health of every machine.
b. Two bulk updates in machines table, each one for healthy and unhealthy.
32. Implementation - Scheduler Jobs
1. A job that run every 10th minute:
a. Generate sql files containing 1 row per 1 cron job to be executed in the next 10 minutes.
b. at 13:10 it will generate sql file containing jobs to be executed at [13:11, 13:12, …, 13:20]
c. After generating it dumps the file in cron_jobs_scheduled (2.5 lakh entries takes < 1 sec)
2. A job that runs every 5th minute:
a. Get health of every machine.
b. Two bulk updates in machines table, each one for healthy and unhealthy.
3. A job that runs every minute:
a. Update cron jobs in PENDING state with trigger time = curMinute - 60 as TIMEOUT
b. For every such job that needs to be retried, insert a row in the corresponding sql file.
33. Implementation - Scheduler Processes
1. A process that at 55th second of every minute:
a. Fetches & then deletes, from cron_jobs_scheduled, jobs to be executed at 60th s [< 1s]
b. Filter jobs whose any of its dependent jobs is in the cache [fetch from cache + build map +
filter] [< 3s]
c. Fetch machines and for each job pick the least busy healthy machine (else an unhealthy one)
[< 1s]
d. Bulk update machines table to increment numOfJobsBeingExecuted [< 100 ms]
e. Bulk insert in cron_jobs_executed [< 1s]
f. Push the jobs in kafka partition corresponding to the machine they were assigned [< 100 ms]
34. Implementation - Scheduler Processes
1. A process that at 55th second of every minute:
a. Fetches & then deletes, from cron_jobs_scheduled, jobs to be executed at 60th s [< 1s]
b. Filter jobs whose any of its dependent jobs is in the cache [fetch from cache + build map +
filter] [< 3s]
c. Fetch machines and for each job pick the least busy healthy machine (else an unhealthy one)
[< 1s]
d. Bulk update machines table to increment numOfJobsBeingExecuted [< 100 ms]
e. Bulk insert in cron_jobs_executed [< 1s]
f. Push the jobs in kafka partition corresponding to the machine they were assigned [< 100 ms]
2. A process that consumes from Kafka topic that gets execution metaData:
a. Asynchronous updates on cache & bulk updates in cron_job_execution for successful and
failed ones.
b. For every failed execution, insert in the sql file using the retry interval.
c. MetaData for each execution like time taken, logs can be consumed in elasticsearch.
35. Scaling Further
1. We used just one instance of DB, processes, jobs, cache.
2. Horizontal scaling can be used.
3. NoSQL can be evaluated as writes don’t lock the table.