Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022

Buckle up!
Field notes for transitioning your daily batch jobs
into realtime architecture.
Current 2022
Valerie Burchby (Game Data Engineering)
Xinran Waibel (Personalization Data Engineering)

NETFLIX
CURRENT 2022
About us
Valerie
● Senior Data Engineer @ Netflix
● Domain: Games Data Engineering
Xinran
● Senior Data Engineer @ Netflix
● Domain: Personalization Data Engineering

NETFLIX
CURRENT 2022
Batch to streaming…
more disruptive than any previous migration.

NETFLIX
CURRENT 2022
Data at rest vs data in motion
Rethinking data movement for your
● ETLs
● Infrastructure
● Team
An analytics team and platform built
for batch will need to make significant
investments to interoperate smoothly
in a streaming environment.

NETFLIX
CURRENT 2022
Agenda
❖ Episode 1: Rethinking Data Flow
❖ Episode 2: Data Quality
❖ Episode 3: Output Optimization
❖ Episode 4: Backfill
❖ Season Finale: TL;DR

NETFLIX
CURRENT 2022
Episode: 1
Rethinking
data flow

📦 Data flow in batch processing
Here’s a data flow which probably looks familiar…
NETFLIX
CURRENT 2022
Ingress
Data Store Batch App
Partitions
Batch App
Data Store
Partitions
Output

🌊 Data flow in stream processing
How hard could this be?
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress

📦 Data flow in batch processing
Closer look…
Assumptions:
Data completeness can be reasonably inferred by job progression.
Jobs are idempotent, can be safely and deterministically rerun.
NETFLIX
CURRENT 2022
Ingress
Data Store Batch App
Partitions
Batch App
Data Store
Partitions
Output

📦 Batch Data Transfer
NETFLIX
CURRENT 2022
21
22
23
00
Source Table
partitioned by event hour,
sales region
✅
✅
✅
⛔
✅
✅
⛔
⛔
Target Table
partitioned by event hour,
sales region
21
22
23
00
Deterministic batch
size and cadence
👀 Dependencies
ready?
👀 Audits passing?

NETFLIX
CURRENT 2022
Losing the “freebies” of batch processing…

🌊 Stream Data Transfer
Streaming = microbatches
of individual records
😵 Job state is fluid,
non-deterministic
😵 lowest latency is least complete
NETFLIX
CURRENT 2022
21
22
23
00
Source
Kafka topic
Target
partition: event hour
event time: 21 22 23 00

Common complexity: job signaling
Problems
This streaming app could go down. How will your batch job know?
Cannot hook into batch environment audit frameworks.
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
📢📢📢

Common complexity: creating logical boundaries for stateful apps
Problems
Events must be held in state while waiting for group members.
Longer hold is more complete, but with a latency penalty.
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
🗡🗡🗡

Common complexity: joining data
Problems
Delay in the streaming app may cause joins to enrich with incorrect data.
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
Streaming App
⏰⏰⏰

Common complexity: duplicates
Problems
Restarts and job instability will cause duplicates in the data store.
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
♻♻♻

And a more complex one…
To learn more details: Beyond Daily Batch Processing
Batch App Output
Data Store
Streaming App
Output
Ingress
Data Store
Streaming App
Batch App
Output
NETFLIX
CURRENT 2022
!
!
♻♻♻ ⏰⏰⏰
❓❓❓
❓❓❓
❓❓❓

NETFLIX
CURRENT 2022
Assumptions for batch 📦
● Source data is static,
materialized, stored in logical
groupings
● ETLs can operate on a fixed
increment (daily, hourly)
● Processing delays do not
materially impact data outcomes

NETFLIX
CURRENT 2022
Assumptions for streaming 🌊
Assumptions for batch 📦
● Source data is static,
materialized, stored in logical
groupings
● ETLs can operate on a fixed
increment (daily, hourly)
● Processing delays do not
materially impact data outcomes
● Source data is ephemeral, risks
becoming lossy
● Streaming applications are
constantly consuming and
producing data, require
attention to “up” time
● Delays to processing can cause
non-deterministic results

NETFLIX
CURRENT 2022
Episode: 2
Data
Quality

NETFLIX
CURRENT 2022
Batch Data Quality 📦
Write-Audit-Publish (WAP)
● Write batch job output to a
temporary location
● Querying and comparing output
with historical data
● Publish data if audits passed

NETFLIX
CURRENT 2022
Stream Data Quality 🌊
Batch Data Quality 📦
Write-Audit-Publish (WAP)
● Write batch job output to a
temporary location
● Querying and comparing output
with historical data
● Publish data if audits passed
Alerting on real-time metrics:
● Job health:
○ Restarts
○ Checkpointing
○ Consumer lag
○ Watermark
● Custom app metrics:
○ Filtered in v.s. out
○ Parsing failures
○ Invalid field values
○ Output volume

NETFLIX
CURRENT 2022
Stream data anomaly
How to cure unhealthy jobs…
💊 (Auto) manual redeployment → Solve 80% of problems
💊 Cluster resource tuning: e.g. out of memory or disk space.
💊 The worst case requires hotfix

NETFLIX
CURRENT 2022
Stream data anomaly
How to cure unhealthy jobs…
💊 (Auto) manual redeployment → Solve 80% of problems
💊 Cluster resource tuning: e.g. out of memory or disk space.
💊 The worst case requires hotfix
How to handle data inaccuracy:
🛠 Avoid full job failure and log info for debugging
🛠 For problematic events:
○ Skip, evict, or reprocess
○ Tip: Consider the nature of the data and use cases.

NETFLIX
CURRENT 2022
Episode: 3
Output

NETFLIX
CURRENT 2022
Optimizing Tables 📦
● File format: columnar formats
(incl. Parquet, ORC)
● Partition columns:
○ Consumer query patterns
○ Data size
○ Key distribution
○ Often by event time

NETFLIX
CURRENT 2022
Optimizing Streams 🌊
Optimizing Tables 📦
● File format: columnar formats
(incl. Parquet, ORC)
● Partition columns:
○ Consumer query patterns
○ Data size
○ Often by event time
● File format: row-based formats
(incl. AVRO)
● Partition strategy:
○ Ordering guarantee
○ Throughput
○ Random partitioning is go-to
● Online and real-time data stores

NETFLIX
CURRENT 2022
Reminder…
Streaming systems have batch consumers too.

NETFLIX
CURRENT 2022
Batch consumers of streaming systems
Real-time
Source
Streaming App
Output
Kafka Topic
Output
Data Lake
😰 But batch output
is often partitioned
by processing time.
Streaming
Consumers
Batch
Consumers
⚠
✅

NETFLIX
CURRENT 2022
Batch consumers of streaming systems
Real-time
Source
Streaming App
Output
Kafka Topic
Output
Data Lake
Streaming
Consumers
Batch
Consumers
⚠
✅
In order to optimize read performance for batch consumers, we need to adopt
meaningful partition strategy for batch output.

NETFLIX
CURRENT 2022
Event-time partitioning for batch output
Challenge: Late arriving data are small but from numerous event-time partitions,
leading to many small files and large amount of memory held for file writing.
Solution: Add batching mechanism that holds late records and only flushes them
out periodically, therefore writing bigger files to older partitions (without
compromising latency of recent events).
To learn more details: Streaming Event-Time Partitioning With Flink and Iceberg

NETFLIX
CURRENT 2022
Episode: 4
Backfill

NETFLIX
CURRENT 2022
Why backfill?
Data applications can fail and produce incorrect
output for many reasons:
● Unexpected input data changes
● Dependent system outage
● Source/sink failures
After failures, we need to backfill to mitigate
downstream impact.
Plus, backfilling is often required when building new
datasets or bootstrapping streaming apps.

NETFLIX
CURRENT 2022
Backfilling is a no brainer for batch job.
… but what about streaming?

The easiest way to backfill is by re-running the streaming job to reprocess source
events from the problematic period.
Problems
😭 Troubleshooting can take hours or days and source data can expire.
😭 Increasing Kafka retention is very expensive.
• Tiered storage could help reduce cost
NETFLIX
CURRENT 2022
Option #1: Replaying source events
Real-time
Source
Output
Streaming App

Build a batch application (e.g. Spark job) that is equivalent to the streaming
application but reads from data lake.
Problems
😵 Maintaining 2 applications in parallel demand signiﬁcant engineering efforts.
Option #2: Lambda Architecture
NETFLIX
CURRENT 2022
Real-time
Source
Data Lake
Source
Batch App
(Backﬁll)
Output
Streaming App
(Prod)

Option #3: Unified batch and streaming
NETFLIX
CURRENT 2022
Real-time
Source
Data Lake
Source
Uniﬁed App
Streaming
Mode
Batch
Mode
Output
Data processing frameworks, such as Apache Flink and Beam offers both batch
and streaming modes.
Problems
😭 Flink requires significant code changes to run batch mode.
😭 Beam only has partial support on state, timers, and watermark.

NETFLIX
CURRENT 2022
Option #4: Kappa Architecture (Netflix’s Choice)
Real-time
Source
Data Lake
Source
Streaming App
Prod
Stack
Backﬁll
Stack
Output
Kafka Topic
Output
Data Lake
The same streaming application streams from Kafka sources for production but
reads from data lake for backfill.
To learn more details: Backfill Streaming Data Pipelines in Kappa Architecture

NETFLIX
CURRENT 2022
Season Finale
TL;DR

NETFLIX
CURRENT 2022
Batch → Streaming: TL;DR
(aka. Oops I fell asleep…)
💬 Rethink data and processing
💬 Lost gifts from batch
💬 Completeness v.s. latency
💬 Be ready for failures and recovery
💬 Ops burden is high (tooling is new)

NETFLIX
CURRENT 2022
Batch
+
Streaming
=
Better Together

Thank You.
Firstname
Lastname
flastname@netfl
ix.com
Contacts
Valerie Burchby (LinkedIn)
Xinran Waibel (Linkedin)

Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022

Recommended

Recommended

More Related Content

Similar to Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022

Similar to Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022