Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Are you considering converting your daily batch ETLs into a new and exhilarating realtime framework? We’ll help you look before you leap as we take a deep dive into the unique operational challenges entailed in transitioning data processing paradigms.
As batched data pipelines consume data from well defined time intervals and write results to partitioned data storage, batched jobs are often idempotent, so the failure recovery is simply rerunning the faulty job instances. Batched data processes are triggered at a certain frequency (e.g. daily or hourly), so the data latency is determined by both the job scheduler and job run time. Therefore, many advanced data use cases, such as frequency capping, requires event streaming to enable real-time data insights. Event streaming applications process unbounded input data in real-time and append output to message queues and/or tables to be further processed. However, real-time data insights are no free meal - because event streaming comes with many unique engineering challenges, such as handling late-arriving and duplicate events, implementing event-time partitioning, and backfilling historical data after failures. In addition, batched-driven and even streaming are not incompatible to each other but can often be better together, as the Delta and Kappa Architecture are commonly adopted in modern data systems.
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
1. Buckle up!
Field notes for transitioning your daily batch jobs
into realtime architecture.
Current 2022
Valerie Burchby (Game Data Engineering)
Xinran Waibel (Personalization Data Engineering)
2. NETFLIX
CURRENT 2022
About us
Valerie
● Senior Data Engineer @ Netflix
● Domain: Games Data Engineering
Xinran
● Senior Data Engineer @ Netflix
● Domain: Personalization Data Engineering
4. NETFLIX
CURRENT 2022
Data at rest vs data in motion
Rethinking data movement for your
● ETLs
● Infrastructure
● Team
An analytics team and platform built
for batch will need to make significant
investments to interoperate smoothly
in a streaming environment.
5. NETFLIX
CURRENT 2022
Agenda
❖ Episode 1: Rethinking Data Flow
❖ Episode 2: Data Quality
❖ Episode 3: Output Optimization
❖ Episode 4: Backfill
❖ Season Finale: TL;DR
7. 📦 Data flow in batch processing
Here’s a data flow which probably looks familiar…
NETFLIX
CURRENT 2022
Ingress
Data Store Batch App
Partitions
Batch App
Data Store
Partitions
Output
8. 🌊 Data flow in stream processing
How hard could this be?
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
9. 📦 Data flow in batch processing
Closer look…
Assumptions:
Data completeness can be reasonably inferred by job progression.
Jobs are idempotent, can be safely and deterministically rerun.
NETFLIX
CURRENT 2022
Ingress
Data Store Batch App
Partitions
Batch App
Data Store
Partitions
Output
10. 📦 Batch Data Transfer
NETFLIX
CURRENT 2022
21
22
23
00
Source Table
partitioned by event hour,
sales region
✅
✅
✅
⛔
✅
✅
⛔
⛔
Target Table
partitioned by event hour,
sales region
21
22
23
00
Deterministic batch
size and cadence
👀 Dependencies
ready?
👀 Audits passing?
12. 🌊 Stream Data Transfer
Streaming = microbatches
of individual records
😵 Job state is fluid,
non-deterministic
😵 lowest latency is least complete
NETFLIX
CURRENT 2022
21
22
23
00
Source
Kafka topic
Target
partition: event hour
event time: 21 22 23 00
13. 🌊 Data flow in stream processing
Common complexity: job signaling
Problems
This streaming app could go down. How will your batch job know?
Cannot hook into batch environment audit frameworks.
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
📢📢📢
14. 🌊 Data flow in stream processing
Common complexity: creating logical boundaries for stateful apps
Problems
Events must be held in state while waiting for group members.
Longer hold is more complete, but with a latency penalty.
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
🗡🗡🗡
15. 🌊 Data flow in stream processing
Common complexity: joining data
Problems
Delay in the streaming app may cause joins to enrich with incorrect data.
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
Streaming App
⏰⏰⏰
16. 🌊 Data flow in stream processing
Common complexity: duplicates
Problems
Restarts and job instability will cause duplicates in the data store.
NETFLIX
CURRENT 2022
Output
Batch App
Data Store
Streaming App
Output
Ingress
♻♻♻
17. 🌊 Data flow in stream processing
And a more complex one…
To learn more details: Beyond Daily Batch Processing
Batch App Output
Data Store
Streaming App
Output
Ingress
Data Store
Streaming App
Batch App
Output
NETFLIX
CURRENT 2022
!
!
♻♻♻ ⏰⏰⏰
❓❓❓
❓❓❓
❓❓❓
19. NETFLIX
CURRENT 2022
Assumptions for batch 📦
● Source data is static,
materialized, stored in logical
groupings
● ETLs can operate on a fixed
increment (daily, hourly)
● Processing delays do not
materially impact data outcomes
20. NETFLIX
CURRENT 2022
Assumptions for streaming 🌊
Assumptions for batch 📦
● Source data is static,
materialized, stored in logical
groupings
● ETLs can operate on a fixed
increment (daily, hourly)
● Processing delays do not
materially impact data outcomes
● Source data is ephemeral, risks
becoming lossy
● Streaming applications are
constantly consuming and
producing data, require
attention to “up” time
● Delays to processing can cause
non-deterministic results
22. NETFLIX
CURRENT 2022
Batch Data Quality 📦
Write-Audit-Publish (WAP)
● Write batch job output to a
temporary location
● Querying and comparing output
with historical data
● Publish data if audits passed
23. NETFLIX
CURRENT 2022
Stream Data Quality 🌊
Batch Data Quality 📦
Write-Audit-Publish (WAP)
● Write batch job output to a
temporary location
● Querying and comparing output
with historical data
● Publish data if audits passed
Alerting on real-time metrics:
● Job health:
○ Restarts
○ Checkpointing
○ Consumer lag
○ Watermark
● Custom app metrics:
○ Filtered in v.s. out
○ Parsing failures
○ Invalid field values
○ Output volume
24. NETFLIX
CURRENT 2022
Stream data anomaly
How to cure unhealthy jobs…
💊 (Auto) manual redeployment → Solve 80% of problems
💊 Cluster resource tuning: e.g. out of memory or disk space.
💊 The worst case requires hotfix
25. NETFLIX
CURRENT 2022
Stream data anomaly
How to cure unhealthy jobs…
💊 (Auto) manual redeployment → Solve 80% of problems
💊 Cluster resource tuning: e.g. out of memory or disk space.
💊 The worst case requires hotfix
How to handle data inaccuracy:
🛠 Avoid full job failure and log info for debugging
🛠 For problematic events:
○ Skip, evict, or reprocess
○ Tip: Consider the nature of the data and use cases.
30. NETFLIX
CURRENT 2022
Batch consumers of streaming systems
Real-time
Source
Streaming App
Output
Kafka Topic
Output
Data Lake
😰 But batch output
is often partitioned
by processing time.
Streaming
Consumers
Batch
Consumers
⚠
✅
31. NETFLIX
CURRENT 2022
Batch consumers of streaming systems
Real-time
Source
Streaming App
Output
Kafka Topic
Output
Data Lake
Streaming
Consumers
Batch
Consumers
⚠
✅
In order to optimize read performance for batch consumers, we need to adopt
meaningful partition strategy for batch output.
32. NETFLIX
CURRENT 2022
Event-time partitioning for batch output
Challenge: Late arriving data are small but from numerous event-time partitions,
leading to many small files and large amount of memory held for file writing.
Solution: Add batching mechanism that holds late records and only flushes them
out periodically, therefore writing bigger files to older partitions (without
compromising latency of recent events).
To learn more details: Streaming Event-Time Partitioning With Flink and Iceberg
34. NETFLIX
CURRENT 2022
Why backfill?
Data applications can fail and produce incorrect
output for many reasons:
● Unexpected input data changes
● Dependent system outage
● Source/sink failures
After failures, we need to backfill to mitigate
downstream impact.
Plus, backfilling is often required when building new
datasets or bootstrapping streaming apps.
37. The easiest way to backfill is by re-running the streaming job to reprocess source
events from the problematic period.
Problems
😭 Troubleshooting can take hours or days and source data can expire.
😭 Increasing Kafka retention is very expensive.
• Tiered storage could help reduce cost
NETFLIX
CURRENT 2022
Option #1: Replaying source events
Real-time
Source
Output
Streaming App
38. Build a batch application (e.g. Spark job) that is equivalent to the streaming
application but reads from data lake.
Problems
😵 Maintaining 2 applications in parallel demand significant engineering efforts.
Option #2: Lambda Architecture
NETFLIX
CURRENT 2022
Real-time
Source
Data Lake
Source
Batch App
(Backfill)
Output
Streaming App
(Prod)
39. Option #3: Unified batch and streaming
NETFLIX
CURRENT 2022
Real-time
Source
Data Lake
Source
Unified App
Streaming
Mode
Batch
Mode
Output
Data processing frameworks, such as Apache Flink and Beam offers both batch
and streaming modes.
Problems
😭 Flink requires significant code changes to run batch mode.
😭 Beam only has partial support on state, timers, and watermark.
40. NETFLIX
CURRENT 2022
Option #4: Kappa Architecture (Netflix’s Choice)
Real-time
Source
Data Lake
Source
Streaming App
Prod
Stack
Backfill
Stack
Output
Kafka Topic
Output
Data Lake
The same streaming application streams from Kafka sources for production but
reads from data lake for backfill.
To learn more details: Backfill Streaming Data Pipelines in Kappa Architecture
42. NETFLIX
CURRENT 2022
Batch → Streaming: TL;DR
(aka. Oops I fell asleep…)
💬 Rethink data and processing
💬 Lost gifts from batch
💬 Completeness v.s. latency
💬 Be ready for failures and recovery
💬 Ops burden is high (tooling is new)