Rule Based Asset Management Workflow Automation at Netflix

Event Driven Task Management
for Digital Assets
Burak BACIOGLU
Staff Software Engineer
NETFLIX
Meenakshi JINDAL
NETFLIX

Agenda
Problem
Statement
Asset Task
Management
Design &
Architecture
Use Cases
Monitoring &
Observability
Kafka

ASSET
DELIVERED
INSPECT
ASSET
TRANSCODE
CHECK
MEDIA
TYPE
VIDEO
FORMAT
& CODEC
VIDEO
RESOLUTION
HAS AUDIO IS ACTIVE
CHECK
LANGUAGE
SHARED
FOR
MARKETING
IS
TAGGED
WITH
THUMBNAIL RESIZE LOCALIZE CLIP / SPLIT SHARE QC …
Standard Assets Tasks
TAG

300+
Tasks
Dependencies
on other Tasks
Multiple Tasks
on the same asset
Multiple Conditions
Only run if asset state
meets these
conditions
Trigger
Only run when a
relevant change
occurs on the asset
Distributed
ownership
Dynamic
New tasks are added,
existing ones are
updated regularly
Shared steps
across tasks
Tasks Characteristics

Single point of failure
Hard to debug, error handling is tricky
Making changes is hard, slow and error-prone
Hard to manage when you have more than a handful of states
Does not allow for parallel execution of several states
Not a good choice when ownership of states is distributed
A Traditional State Machine?

Hard to scale, if certain tasks are executed more frequently
Single point of failure
Making changes, adding steps / branches is hard and error-prone
Difficult to understand / monitor the ﬂow
Error handling is tricky
Ownership ambiguity
A Large Monolith Tasks Chain Management

What we did
Separate the rules from the tasks.
Keep tasks simple, stateless.
The rules specify the changes needed to trigger a task on an asset.
Listen to asset change events to evaluate rules
Asset
Management
Service
EVALUATE
RULES
BASED ON
ASSET
CHANGES
Each task execution alters an asset’s state, resulting in a new event
rules
tasks
TASK
LISTENER

ASSET
IS A PSD FILE
MIME TYPE ANALYSIS
Flatten PSD assets
Only assets associated with Netﬂix Originals
Marketing related assets only
LAYER COUNT > 1
< 10K px width
NFLX Original Movie ID
Tagged w/ MARKETING
FILE INSPECTION
ASSET
Create PSD Thumbnail
IS SINGLE LAYER
…
Task

Improved understanding of the tasks
Easy to modify and add tasks without impacting others
Abstraction from other tasks and how they run
Ability to enable/disable/scale tasks individually
Error handling / retries can be done at the task level
Distributed ownership of tasks
Improved monitoring and tracing of executions via events
Event Driven Rule Engine with Micro Workﬂows

Assets Management
Service
Event Listener
Tasks Management Service
Rule
Evaluation
Scheduler
Task
Executor Workers
Studio
Applications
Micro Tasks
Inspection
Transcode
Video thumbnail
Clip
Localize
Shot Boundary
Audio Waveform
Video Text
Rules
Repository
Kafka
QC
…
High Level Architecture
Asset Client

Conductor
● Workflow Orchestration Engine
● Workflow blueprint
● Pool of task workers hosted in
user’s services
● Advanced configuration
● Execution management
● Observability
● UI
➡ conductor.netflix.com

Asset Changes
1.0
file://location
1.1
file://location
> PSD
1.2
file://location
PSD
> 1920px
> 15 layers
1.3
file://location
PSD
1920px
15 layers
> Marketing Tag
1.4
file://location
PSD
1920px
15 layers
Marketing Tag
> NFLX Show
M
I
M
E
T
Y
P
E
A
N
A
L
Y
S
I
S
P
S
D
F
I
L
E
I
N
S
P
E
C
T
I
O
N
S
H
A
R
E
W
/
M
K
T
G
T
E
A
M
F
L
A
T
T
E
N
P
S
D

Task
TRIGGERS
CONDITIONS
ACTION
EVALUATE
RESULT
SUCCESS FAILURE
NAME VERSION
STATUS OWNER

AND
AND
OR
OR
File exists
Status is
ACTIVE
INSPECTION
task is complete
Type is VIDEO CMYK RGB
Determine if a task is eligible to
run for the given asset
IF
Task Deﬁnition: Conditions

Asset Type is IMAGE.
{
"field" : "asset.payload.type",
"value" : "IMAGE"
}
Asset Type is IMAGE
AND
the file exists.
{
"operator" : "AND",
"conditions" : [
{
"field" : "asset.payload.type",
"value" : "IMAGE"
}, {
"field" : "asset.payload.file.size",
"gt" : "1MB"
}
]
}
Task Deﬁnition: Conditions

Status changed from
INACTIVE to ACTIVE
Movie ID
changed
Define when an eligible task
should be triggered
WHEN
Task Deﬁnition: Triggers
OR
OR

Asset status changed
from INACTIVE
to ACTIVE
{
"field" : "asset.payload.status",
"from" : "INACTIVE",
"to" : "ACTIVE"
}
Task Deﬁnition: Triggers

TRIGGERS
CONDITIONS
RUN
WORKFLOW
CONDUCTOR
Workflow
Orchestrator
EVALUATE
RESULT
SUCCESS STEP FAILURE STEP
Determine which actions will be
executed if the conditions and
triggers are both satisfied
WHAT
Task Deﬁnition: Actions

Start a workflow.
{
"action" : "CONDUCTOR",
"workflowName" : "resize_image",
"parameters" : {
"input" : "${asset.payload.file.location}",
"recipe" : "square_720px",
"output" : "...",
…
}
}
Task Deﬁnition: Actions

Highly Available 48 Partitions
Dedicated
Cluster
Retention Policy
72 hours
Kafka Broker

Send
Event
Persist
Asset
Producer
Asset Management Service

Still possible to hit the 1MB limit
Increased producer responsibility
No additional RPC from consumer
High storage cost
Payload Diff
Small payload size
Diff’ing done on the consumer side
Additional RPC to fetch payload
Low storage cost
ID + Versions
Can only send up to 1 MB
Diff’ing done on the consumer side
No additional RPC from consumer
High storage cost
Full Payload
What to include in the event payload?
Producer
Asset Management Service

batch.size
default
other properties
default
max.request.size
default
Custom serializers
(if needed)
Producer Conﬁgurations

Fetch
Asset
Receive
Event
Enrich
Payload
Store
Payload +
Diff
Evaluate
Rules
Start
Tasks
Monitor
Tasks
Create
Diff
Fetch the asset versions speciﬁed in the event payload
Consumer

Fetch
Asset
Receive
Event
Enrich
Payload
Store
Payload +
Diff
Evaluate
Rules
Start
Tasks
Monitor
Tasks
Create
Diff
Enrich the payload with movie metadata: isMovieLive, isNetﬂixOriginal, etc.
Could use another Kafka topic or GlobalKTable for joins
Opted to use a distributed cache (github.com/Netﬂix/EVCache)
Consumer

Fetch
Asset
Receive
Event
Enrich
Payload
Store
Payload +
Diff
Evaluate
Rules
Start
Tasks
Monitor
Tasks
Create
Diff
Consumer

Fetch
Asset
Receive
Event
Enrich
Payload
Store
Payload +
Diff
Evaluate
Rules
Start
Tasks
Monitor
Tasks
Create
Diff
Ack event
Consumer

auto.commit
false
max.partition.fetch.bytes
500Kb
max.poll.records
100
Custom
deserializers
(if needed)
Consumer Conﬁgurations
Process each record polled in parallel to increase throughput

Consumer Tips
Process each record polled in parallel to increase throughput
Assuming the order is not important
Try to design with idempotency in mind
Ability to Auto-scale
Monitor Kafka-latency or number of events produced

Recreate
Missed Events
Fix
tasks executed
incorrectly
Backﬁll
after a new task
is introduced
Backﬁlls & Migrations
Migrate
assets to a new
task version

Version 4.0
Version 3.0
Version 1.0
Version 2.0
Extract all
versions for all
assets and
create events
Asset
Backﬁll after a new task is introduced
Consumers
Backﬁll
Consumer
Task Name
added to
payload

Key Considerations for Backfills & Migrations
Separate Kafka
Topic
Isolation from
production flow
Adjusted
retention and
partition size
Separate
Consumer Group
Avoid impact on
production flow
Scaled
Independently
Can run slower
Dependent
Services
Identify the limits
of dependent
services
Self-throttle
Monitoring &
Alerts
Dashboards
Kafka Lag
Services Health
Error Rates
Request Control
Tuneable event
production rate
Ability to pause /
suspend if
necessary

Event Driven Distributed Systems
Micro
services
Micro
services

Observability - Real Time System Insight
Monitoring
Alerting
Discovery
TroubleShooting
Performance Analysis
System Impact
● What?
● When?
● Why?
● How?

Assets Management
Service
Workﬂow
Management Service
Workers
Events Processor
Centralized
Logging
Service
Id Time-
Stamp
Event
Id
Asset Id Task
execution
Id
100 Sep 17,
16:47:40.844
95f9.. bade.. aedd-4
101 Sep 17,
16:47:40.851
95f9.. bade.. aedd-4
102 Sep 17,
16:47:40.900
95f9.. bade.. cbaa-2
104 Sep 17,
16:47:40.950
95f9.. bade.. cbaa-2
Events Traceability

Ack after successful processing of events
If processing fails send the failed event to a separate Kafka topic for reprocessing
Multiple topics based on priority and/or retry count
If reprocessing fails, save the event in a Dead Letter Queue
Kafka topic, or a simple KV or DB table
Consider adding the ability to pause consumers in case of a major outage
Dynamic properties via Archaius (github.com/Netﬂix/archaius)
Failures, Reprocessing and Dead Letter Queues

Consumers
Events Store
(DLQ)
Query Failed
Events
Version 1
Failed Events can be out of order
Asset Change
Consumer
Failure
Consumer
DLQ Consumer
Failures, Reprocessing and Dead Letter Queues
Asset Mgmt Svc
Version 3
Version 2

Unit Testing
Integration
Testing
Replay
Testing
Test Producer, Consumer in Isolation
Simulate scenarios that validate Producer, Event
generation, Event payload, Consumers
Load
Testing
Simulate high scale events to ensure latency,
scalability and understand system limitations
Reprocessing of events at runtime
Idempotency
Testing
Verify processing same event multiple times has
same effect as processing it once

Producer Metrics
Consumer Metrics

Auto Scaling
Consumer
Minimum
Desired
Maximum
Publisher

Alerts Should be Timely and Actionable
(With Context)

Burak BACIOGLU
NETFLIX
Meenakshi JINDAL
NETFLIX

Rule Based Asset Management Workflow Automation at Netflix

Recommended

Recommended

More Related Content

Similar to Rule Based Asset Management Workflow Automation at Netflix

Similar to Rule Based Asset Management Workflow Automation at Netflix (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Rule Based Asset Management Workflow Automation at Netflix