"At Netflix, we deal with millions of digital assets every day. Hours of video clips, along with audio, text and image assets are ingested for various purposes. Several workflows are then executed on them; such as inspection, transcoding, editing, logging, etc. These assets can also be used in machine learning workflows, either to train these models, or to get content insights. Not all workflows are applicable to all assets, and some workflows depend on other workflows to run. Additionally, new workflows are introduced regularly, and they need to be executed on existing assets, as well.
We implemented a workflow rule engine that allows users to define rules and conditions to specify the applicable workflows for assets, based on their types, metadata and states. In order to make this system scalable and fault tolerant, we utilize Kafka to send out events on asset state changes (on create, update, workflow completion, etc.) with minimal information in the payload (asset id and version). The rule engine then enriches this payload by fetching additional metadata, evaluates it against the workflow rules, triggers applicable workflows based on the outcome, and monitors their results by listening to the workflow events.
By using a highly available Kafka setup, we can easily scale, handle ETL cases such as migrations, replay messages if needed without impacting asset ingestions."
6. 300+
Tasks
Dependencies
on other Tasks
Multiple Tasks
on the same asset
Multiple Conditions
Only run if asset state
meets these
conditions
Trigger
Only run when a
relevant change
occurs on the asset
Distributed
ownership
Dynamic
New tasks are added,
existing ones are
updated regularly
Shared steps
across tasks
Tasks Characteristics
8. Single point of failure
Hard to debug, error handling is tricky
Making changes is hard, slow and error-prone
Hard to manage when you have more than a handful of states
Does not allow for parallel execution of several states
Not a good choice when ownership of states is distributed
A Traditional State Machine?
10. Hard to scale, if certain tasks are executed more frequently
Single point of failure
Making changes, adding steps / branches is hard and error-prone
Difficult to understand / monitor the flow
Error handling is tricky
Ownership ambiguity
A Large Monolith Tasks Chain Management
11. What we did
Separate the rules from the tasks.
Keep tasks simple, stateless.
The rules specify the changes needed to trigger a task on an asset.
Listen to asset change events to evaluate rules
Asset
Management
Service
EVALUATE
RULES
BASED ON
ASSET
CHANGES
Each task execution alters an asset’s state, resulting in a new event
rules
tasks
TASK
LISTENER
12. ASSET
IS A PSD FILE
MIME TYPE ANALYSIS
Flatten PSD assets
Only assets associated with Netflix Originals
Marketing related assets only
LAYER COUNT > 1
< 10K px width
NFLX Original Movie ID
Tagged w/ MARKETING
FILE INSPECTION
ASSET
Create PSD Thumbnail
IS SINGLE LAYER
…
Task
13. Improved understanding of the tasks
Easy to modify and add tasks without impacting others
Abstraction from other tasks and how they run
Ability to enable/disable/scale tasks individually
Error handling / retries can be done at the task level
Distributed ownership of tasks
Improved monitoring and tracing of executions via events
Event Driven Rule Engine with Micro Workflows
14. Assets Management
Service
Event Listener
Tasks Management Service
Rule
Evaluation
Scheduler
Task
Executor Workers
Studio
Applications
Micro Tasks
Inspection
Transcode
Video thumbnail
Clip
Localize
Shot Boundary
Audio Waveform
Video Text
Rules
Repository
Kafka
QC
…
High Level Architecture
Asset Client
15. Conductor
● Workflow Orchestration Engine
● Workflow blueprint
● Pool of task workers hosted in
user’s services
● Advanced configuration
● Execution management
● Observability
● UI
➡ conductor.netflix.com
16. Asset Changes
1.0
file://location
1.1
file://location
> PSD
1.2
file://location
PSD
> 1920px
> 15 layers
1.3
file://location
PSD
1920px
15 layers
> Marketing Tag
1.4
file://location
PSD
1920px
15 layers
Marketing Tag
> NFLX Show
M
I
M
E
T
Y
P
E
A
N
A
L
Y
S
I
S
P
S
D
F
I
L
E
I
N
S
P
E
C
T
I
O
N
S
H
A
R
E
W
/
M
K
T
G
T
E
A
M
F
L
A
T
T
E
N
P
S
D
28. Still possible to hit the 1MB limit
Increased producer responsibility
No additional RPC from consumer
High storage cost
Payload Diff
Small payload size
Diff’ing done on the consumer side
Additional RPC to fetch payload
Low storage cost
ID + Versions
Can only send up to 1 MB
Diff’ing done on the consumer side
No additional RPC from consumer
High storage cost
Full Payload
What to include in the event payload?
Producer
Asset Management Service
36. Consumer Tips
Process each record polled in parallel to increase throughput
Assuming the order is not important
Try to design with idempotency in mind
Ability to Auto-scale
Monitor Kafka-latency or number of events produced
38. Version 4.0
Version 3.0
Version 1.0
Version 2.0
Extract all
versions for all
assets and
create events
Asset
Backfill after a new task is introduced
Consumers
Backfill
Consumer
Task Name
added to
payload
39. Key Considerations for Backfills & Migrations
Separate Kafka
Topic
Isolation from
production flow
Adjusted
retention and
partition size
Separate
Consumer Group
Avoid impact on
production flow
Scaled
Independently
Can run slower
Dependent
Services
Identify the limits
of dependent
services
Self-throttle
Monitoring &
Alerts
Dashboards
Kafka Lag
Services Health
Error Rates
Request Control
Tuneable event
production rate
Ability to pause /
suspend if
necessary
43. Observability - Real Time System Insight
Monitoring
Alerting
Discovery
TroubleShooting
Performance Analysis
System Impact
● What?
● When?
● Why?
● How?
44.
45. Assets Management
Service
Workflow
Management Service
Workers
Events Processor
Centralized
Logging
Service
Id Time-
Stamp
Event
Id
Asset Id Task
execution
Id
100 Sep 17,
16:47:40.844
95f9.. bade.. aedd-4
101 Sep 17,
16:47:40.851
95f9.. bade.. aedd-4
102 Sep 17,
16:47:40.900
95f9.. bade.. cbaa-2
104 Sep 17,
16:47:40.950
95f9.. bade.. cbaa-2
Events Traceability
46.
47.
48. Ack after successful processing of events
If processing fails send the failed event to a separate Kafka topic for reprocessing
Multiple topics based on priority and/or retry count
If reprocessing fails, save the event in a Dead Letter Queue
Kafka topic, or a simple KV or DB table
Consider adding the ability to pause consumers in case of a major outage
Dynamic properties via Archaius (github.com/Netflix/archaius)
Failures, Reprocessing and Dead Letter Queues
49. Consumers
Events Store
(DLQ)
Query Failed
Events
Version 1
Failed Events can be out of order
Asset Change
Consumer
Failure
Consumer
DLQ Consumer
Failures, Reprocessing and Dead Letter Queues
Asset Mgmt Svc
Version 3
Version 2
50. Unit Testing
Integration
Testing
Replay
Testing
Test Producer, Consumer in Isolation
Simulate scenarios that validate Producer, Event
generation, Event payload, Consumers
Load
Testing
Simulate high scale events to ensure latency,
scalability and understand system limitations
Reprocessing of events at runtime
Idempotency
Testing
Verify processing same event multiple times has
same effect as processing it once