What is Temporal.io?
What is Temporal.io?
https://www.youtube.com/watch?v=0BdLUnay9ok&ab_channel=fwdays
What is Temporal.io?
#Short description based on documentation
What is Day.io?
What is Day.io?
- Collect clock in/out events (punches) from mobile
and tablet devices, integrations and web widgets.
- For each employee for each shift calculate
employees basic metrics for salary payments:
worked time, rested time, night shifts time, overtime
and many more.
- Update aggregated metrics for payroll period report.
- Generate files with custom formatting of stored
JSON-data.
What is Day.io?
Requirements
- Bursty Load of events that triggers recalculations:
Requirements
- Deduplication & throttling
Previously we had huge queues and all the events were processed one by
one although we know there are duplicates.
Requirements
- Persist results on different stages
- Optimisations inside calculations pipeline
- Scalability
- Retry logic
- Transactions
Where is load?
- ~2M events per day
- Synchronous calculations:
- >30 metrics for given period (from - to)
- >15 metrics that depends on previous day (from - now):
changing a day 1 year ago triggers recalculation of entire year!
High-level setup
- 5 kafka consumers
- In-house Temporal cluster on top of
Postgres
- 2 workers for High Priority workflows
- 15 workers for Low Priority
Workflow, version 1
- Single workflow run for same
employee and same period
(controlled by Temporal)
- Policy: Terminate running if new
with same id arrived
Workflow, version 1
Outputs:
- faster than old setup as we started
interrupting calculations if same event
arrived
- Temporal cluster was almost at ~100%
CPU every time we had even small load
- Workers ~100% CPU
- Scaling didn’t help
Next steps:
- Reduce amount of workflows
- Fine tune configurations
Workflow, version 2
- Single workflow run for same
employee (controlled by Temporal)
- Notify running workflow about
consumed event for same
employee
- In-memory queue for each
workflow with deduplication logic
for intersection periods
Workflow, version 2
Outputs:
- Faster for clients again. This time
also because of deduplication
- Temporal cluster was almost at
~100% CPU at peak time
- Scaling started working. But x2-2.5
max, not as promised
Next step: We cannot reduce amount of
workflows anymore. So let’s reduce
amount of activities
Workflow, version 3
- Merged all calculations, persisting and
sending events activities into single
Outputs:
- Faster for clients again due to processing less
system micro tasks inside each workflow.
- Temporal cluster finally not hitting 100% CPU
- Improved scaling from x2-2.5 to x5-7. But still
not what we were promised
Next step: We can merge last 2 activities but it’s not
a game changer. Let’s decrease latency inside
cluster by replacing database
Replacing Postgres with Cassandra
Benchmark results for 12K workflows:
Postgres based Cassandra based
Total time ~45 minutes ~15 minutes
Workflows starts, per sec ~8 ~40
History pods latency, P99 ~6ms, bursty ~0.5ms
GetTasks requests rate, per sec ~4 ~75
Other workflows types (convert into describe slides)
1. Long running workflows to send alerts with delay
2. Background bulk operations processing workflow
3. Cron jobs
4. Orchestrating data migrations
Next steps
- Merge our consecutive workflows to use power of signals and avoid
duplicated business logic runs
- Use Local Activities
- Move more cron jobs and business transactions into Temporal
- Implement Human-in-the-Loop business processes like onboarding
Pros:
- Retry logic and transactions from scratch. Your sagas will look like
pure functions.
- Durable executions of long-running without additional work
- Safe Cron jobs, especially long running or iterative.
- Triggers and workers can be implemented in different languages
- Easy to setup in-house cluster using Kubernetes.
- Great UI visualisation tools
- Great slack community, documentation and open-source code
Lessons learned
Temporal is nice tool if you know what you need, as all the other tools.
Lessons learned
Temporal is nice tool if you know what you need, as all the other tools.
Cons:
- Not very performant for short-running workflows with huge amount
of activities.
- Postgres is enough for flat load. Use Cassandra for spikes
- Bugs with Cassandra.
- CLI Batch operations on workflows are very slow
- Expensive to keep cluster if workflows are not running all the time
- High entry barrier
Sources
Temporal.io Slack - https://t.mp/slack
Official documentation - https://docs.temporal.io
Blog - https://temporal.io/blog
Samples - https://github.com/temporalio/samples-typescript (also for Go, Python, Java, .Net)

"Scaling in space and time with Temporal", Andriy Lupa .pdf

  • 2.
  • 3.
  • 4.
    What is Temporal.io? #Shortdescription based on documentation
  • 5.
  • 6.
    What is Day.io? -Collect clock in/out events (punches) from mobile and tablet devices, integrations and web widgets. - For each employee for each shift calculate employees basic metrics for salary payments: worked time, rested time, night shifts time, overtime and many more. - Update aggregated metrics for payroll period report. - Generate files with custom formatting of stored JSON-data.
  • 7.
  • 8.
    Requirements - Bursty Loadof events that triggers recalculations:
  • 9.
    Requirements - Deduplication &throttling Previously we had huge queues and all the events were processed one by one although we know there are duplicates.
  • 10.
    Requirements - Persist resultson different stages - Optimisations inside calculations pipeline - Scalability - Retry logic - Transactions
  • 11.
    Where is load? -~2M events per day - Synchronous calculations: - >30 metrics for given period (from - to) - >15 metrics that depends on previous day (from - now): changing a day 1 year ago triggers recalculation of entire year!
  • 12.
    High-level setup - 5kafka consumers - In-house Temporal cluster on top of Postgres - 2 workers for High Priority workflows - 15 workers for Low Priority
  • 13.
    Workflow, version 1 -Single workflow run for same employee and same period (controlled by Temporal) - Policy: Terminate running if new with same id arrived
  • 14.
    Workflow, version 1 Outputs: -faster than old setup as we started interrupting calculations if same event arrived - Temporal cluster was almost at ~100% CPU every time we had even small load - Workers ~100% CPU - Scaling didn’t help Next steps: - Reduce amount of workflows - Fine tune configurations
  • 15.
    Workflow, version 2 -Single workflow run for same employee (controlled by Temporal) - Notify running workflow about consumed event for same employee - In-memory queue for each workflow with deduplication logic for intersection periods
  • 16.
    Workflow, version 2 Outputs: -Faster for clients again. This time also because of deduplication - Temporal cluster was almost at ~100% CPU at peak time - Scaling started working. But x2-2.5 max, not as promised Next step: We cannot reduce amount of workflows anymore. So let’s reduce amount of activities
  • 17.
    Workflow, version 3 -Merged all calculations, persisting and sending events activities into single Outputs: - Faster for clients again due to processing less system micro tasks inside each workflow. - Temporal cluster finally not hitting 100% CPU - Improved scaling from x2-2.5 to x5-7. But still not what we were promised Next step: We can merge last 2 activities but it’s not a game changer. Let’s decrease latency inside cluster by replacing database
  • 18.
    Replacing Postgres withCassandra Benchmark results for 12K workflows: Postgres based Cassandra based Total time ~45 minutes ~15 minutes Workflows starts, per sec ~8 ~40 History pods latency, P99 ~6ms, bursty ~0.5ms GetTasks requests rate, per sec ~4 ~75
  • 19.
    Other workflows types(convert into describe slides) 1. Long running workflows to send alerts with delay 2. Background bulk operations processing workflow 3. Cron jobs 4. Orchestrating data migrations
  • 20.
    Next steps - Mergeour consecutive workflows to use power of signals and avoid duplicated business logic runs - Use Local Activities - Move more cron jobs and business transactions into Temporal - Implement Human-in-the-Loop business processes like onboarding
  • 21.
    Pros: - Retry logicand transactions from scratch. Your sagas will look like pure functions. - Durable executions of long-running without additional work - Safe Cron jobs, especially long running or iterative. - Triggers and workers can be implemented in different languages - Easy to setup in-house cluster using Kubernetes. - Great UI visualisation tools - Great slack community, documentation and open-source code Lessons learned Temporal is nice tool if you know what you need, as all the other tools.
  • 22.
    Lessons learned Temporal isnice tool if you know what you need, as all the other tools. Cons: - Not very performant for short-running workflows with huge amount of activities. - Postgres is enough for flat load. Use Cassandra for spikes - Bugs with Cassandra. - CLI Batch operations on workflows are very slow - Expensive to keep cluster if workflows are not running all the time - High entry barrier
  • 23.
    Sources Temporal.io Slack -https://t.mp/slack Official documentation - https://docs.temporal.io Blog - https://temporal.io/blog Samples - https://github.com/temporalio/samples-typescript (also for Go, Python, Java, .Net)