Flink at netflix paypal speaker series

Stream Processing with Flink at Netflix
Monal Daxini
Stream Processing Lead
7/5/2018
@monaldax

• Worked on stream processing platform for business analytics for 4 years
• Including vision roadmap, and implementation
• Introduced Flink to Netflix 2 years ago, and drove adoption
Profile

@monaldax
Point & Click
Keystone Pipeline
Routing, Filtering, Projection
At-least-once
Streaming Jobs
Analytics, Applications,
Platforms
How do we leverage Flink?
Current Using Flink 1.4
Streaming SQL
(Future)

@monaldax
& Streaming Jobs
(Platform Overview)

@monaldax
Event
Producers
Sinks
Ingest Pipelines Are The Backbone Of
Real-time Data Infrastructure
SERVERLESS
Turnkey
100% in AWS
Fully Managed

@monaldax
Configure 1 Data Stream, A Filter, & 3 Sinks

@monaldax
Chainable & Optional Filter & Projection

@monaldax
Keystone Self-serve – Kafka Sink Partition Key Support

Event
Producer
Create 2 Kafka Topics, And Started Three Separate Jobs
SPaaS Router
Fronting
Kafka
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 Jobs1 Topic
Keystone Management
1 Topic
@monaldax

@monaldax
Keystone Router Admin Tooling and Netflix Integration

Dashboard Generated For Provisioned Streams

Event Flow: Producer Uses Kafka Client Wrapper Or Proxy
SPaaS Router
Fronting
Kafka
Event
Producer
KSGateway
Consumer Kafka
Keystone Management
KCW
Elasticsearch
@monaldax

Event Flow: Events Queued In Kafka
SPaaS RouterFronting
Kafka
Event
Producer
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 instances
Keystone Management
@monaldax

Event Flow: Each Router Reads From Source, Optionally Applies Filter
& Projection
Kafka
Event
Producer
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 Jobs
Keystone Management
@monaldax

Event Flow: Each Router Writes To Their Respective Sinks
Kafka
Event
Producer
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 Jobs
Non-Keyed Keyed Supported
Keystone Management
@monaldax

Flink
RouterFronting
Kafka
Event
Producer
Keystone
Management
KSGateway
Consumer
Kafka
Stream
Consumers
HTTP /
gRPC

Keystone Router Scale
● ̴2,000 routing jobs, ~̴10,000 containers
● 3000 m4.4xl ( 8 cores, 64MB, 2G network )
● ̴3 trillion events processed /day, 600B - 1T uniques
○ Peak: 12M events in / sec & 36 GB in / sec

Event
Producer
Stream Processing Platform
Router
Fronting
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Elasticsearch
Flink Streaming Job
@monaldax

Generate Streaming Job From Template
@monaldax

Run And Debug Locally In The IDE
@monaldax

(1)
(2)
(3)
Developer Ergonomics To Work With Sources & Sinks (in red)

Deploying A Streaming Job In Test
@monaldax

@monaldax
Anatomy Of a Flink Job

Router Flink Job - Standalone Cluster Mode
Kafka
Event
Producer
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 instances
Keystone Management
@monaldax
Flink Job

@monaldax
Router Flink Job In HA Mode With 1 Job Manager
Isolation - 1 Flink Clusters Runs One Routing Job
Zookeeper
Job Manager
Leader (WebUI)
Task Manager Task Manager Task Manager
One dedicated Zookeeper
cluster for all streamig Jobs

@monaldax
Flink Job Cluster In HA Mode With 2 Job Managers
Zookeeper
Job Manager
Leader (WebUI)
Task Manager Task Manager Task Manager
Job Manager
(WebUI)
One dedicated Zookeeper
cluster for all streamig Jobs

Stream Processing Platform - Layered cake
Amazon EC2
Titus Container Runtime
Stream Processing Platform
(Flink Streaming Engine, Config Management)
Reusable Components
Source & Sink Connectors, Filtering, Projection, etc.
Routers
(Streaming Job)
Streaming Jobs
ManagementService&UI
Metrics&Monitoring
StreamingJobDevelopment
Dashboards
@monaldax

@monaldax
Types of Streaming Jobs

Broadly, Two Categories Of Streaming Jobs
• Stateless
• No state maintained across events (except Kafka offsets)
• Stateful
• State maintained across events
@monaldax

Image adapted from: Stephan Ewen
Stateless Stream Processor – No Internal State
@monaldax

Stateless Stream Processor – External State
Image adapted from: Stephan Ewen@monaldax

Keystone Routing Jobs
● Stateless (except Kafka offsets)
● Embarrassingly parallel
SPaaS Router

@monaldax
Stateless Streaming Job Use Case: High Level Architecture
Enriching And Identifying Certain Plays
Playback
History Service
Video
Metadata
Streaming Job
Play Logs
Live Service Lookup Data

Stateful Stream Processing
Image adapted from: Stephan Ewen@monaldax

Search Sessionization – Custom Windowing On Out-of-order Events
...... S ES
……….Session 2: S
Hours
S E
Session 1:
SE …
@monaldax

Streaming Application
Flink Engine
Local State
Stateful Streaming Application With Local State, Checkpoints, And
Savepoints
Sinks
Savepoints
(Explicitly Triggered)
Ext. Checkpoints
(Automatic)
Sources
@monaldax

Checkpoint Ack With Metadata
Job
Manager
ACK
(metadata)
S3
State
Snapshot
barriers

Checkpoint Metadata File After All Acks
Job
Manager
S3
Checkpoint
Metadata File

Flink State Backends
Available
● Memory
● File system
● RocksDB
(support incremental checkpoint)

State
● State sharding - Key Groups
○ Cannot change once set for a job
○ Total scale up limited by # of Key Groups
○ Lightweight - can have 1000s of Key Groups
● Scale up only from a savepoint, not from a checkpoint

● Easy to use Data-flow programming model (VLDB, 2015)
○ Tools for reasoning about time - Event-time, watermarks, etc.
● Support for all messaging semantics including Exactly-once
● Exactly-once (semantics) fault tolerant state management (large state)
○ Enables Kappa Architecture
● Multiple connectors - sources and sinks
Why Flink?

● Multi stage jobs with exactly once processing semantics
○ Without the need for an external queue
○ Parallelism can be set independently for each operator in the DAG
● Backpressure support
● Support for Scala & Java
● Open source and active Community
Why Flink?

Scaling S3 Based Flink Checkpoint Store
● S3 automatically shards buckets internally with steady growth in RPS
● Lack of entropy in prefix path hinders scalability
@monaldax

Inject Dynamic Entropy In Checkpoint Path [FLINK-9061]
(4-char random hex)
state.backend.fs.checkpointdir.injectEntropy.enabled = true
state.backend.fs.checkpointdir.injectEntropy.key =__ENTROPY_KEY__
state.backend.fs.checkpointdir =
s3://b1/checkpoints/__ENTROPY_KEY__/b3ee-1527840560075/1439
State.backend.fs.memory-threshold = 1048576

Reducing Snapshots Uploaded To S3
For stateless, or jobs with very small state
● Avoid S3 writes from Task Manager, aggregate state in Job Manager
state.backend.fs.memory-threshold = 1048576

Checkpoint Metadata File After All Acks
Job
Manager
Amazon
S3
2. Checkpoint
Data & Metadata
1. Snapshot
data & ACK
@monaldax

High Frequency Checkpoints
● For checkpointing interval < 5sec, and large flink clusters,
consider a different store than S3.

Too Many HEAD Requests, Despite Zero S3 Writes
• Create CheckpointStreamFactory only once during operator initialization
(FLINK-5800)
• Fixed since 1.3

Fine Grained Recovery
Job DAB With 3 Operators with parallelism 3
A1
A2
A3
B1
B2
B3
C1
C2
C3

Fine Grained Recovery - 1 Operator Instance Fails
A1
A2
A3
B1
B2
B3
C1
C2
C3

Fine Grained Recovery - Restart Greyed Out Operator Instances
A1
A2
A3
B1
B2
B3
C1
C2
C3

Fine Grained Recovery - Recovered
A1
A2
A3
B1
B2
B3
C1
C2
C3

Life without fine grained recovery
Each kill (every 10 minutes) caused ~2x spikes

Sometimes Revert To Full Restart
Full restartFine grained
recovery

Current implementation issue (FLINK-8042)
● Revert to full restart immediately if replacement container didn’t come
back in time (FLINK-8042)
● Issue will be addressed in FLIP-6

Workaround For Flink-8042 - +1 Standby Container
Zookeeper
Job Manager
Leader (WebUI)
Task Manager Task Manager Task Manager Task Manager
+1

Fine Grained Recovery With Standby Taskmanager

Challenges Of Stateful Job With Large State
1. Cannot avoid S3 writes from task managers
a. Each task manager has large state
2. Fine grained recovery (+1 standby) does not help
a. Connected job graph
3. Long recovery times

Stateful Jobs Often Have Data Shuffle Operation
A1
A2
A3
B1
B2
B3
C1
C2
C3
keyBysource window sink
S3
HDD
HDD
HDD

A1
A2
A3
B1
B2
B3
C1
C2
C3
One Task Failure Leads To Full Job Restart - Connected Graph
S3
HDD
HDD
HDD

Copy Checkpointed Data To Local HDD From S3, Restart Job
A1
A2
A3
B1
B2
B3
C1
C2
C3
TM #1
TM #2
TM #3
S3
HDD
HDD
HDD

With Task local recovery (FLINK-8360), Flink 1.5+
Snapshot loaded from Local HDD if available.
A1
A2
A3
B1
B2
B3
C1
C2
C3
TM #1
TM #2
TM #3
S3
HDD
HDD
HDD

Explore: Snapshot Store And Recovery From EBS
A1
A2
A3
B1
B2
B3
C1
C2
C3
TM #1
TM #2
TM #3 EBS
EBS
EBS

Explore: Replicate State To Other Task Managers, And Move Computation
With Optional State Backup to S3
A1
A2
A3
B1
B2
B3
C1
C2
C3
TM #1
TM #2
TM #3
S3
HDD
HDD
HDD

Savepoint Metrics
● Size: 21 TBs
● Take time: 27 mins
● Recovery time: 6 mins
Incremental checkpoint Metrics
● Checkpoint interval: 15 mins
● Size (avg): 950 GB
● Duration (avg): 2.5 minutes
● 200 Containers / 16 vCPUs, 54 GB memory, 108 GB SSD-backed EBS
● Parallelism 3200

@monaldax
Backfill / Rewind,
Schema, SideInput

Flink Job Requires Reprocessing
Option 1: Rewind To A Checkpoint In The Past With Kafka Source
Time
Checkpoint y Checkpoint x
outage period
Checkpoint x+1
Now

Option 1: Rewind To A Checkpoint In The Past With Kafka
Source
Time
Checkpoint y Checkpoint x
outage period
Checkpoint x+1
Now

Option 2: Reprocess From Backfill source In Parallel
TimeNow ->outage period

Stateless Job - Works! Stateful Job Challenge - Partial
Ordered But Bounded Source and State Warm-up
Timeoutage periodWarm-up period
Emit outputNo output
Partial-Ordered

Complex Case of Reprocessing Events
What Should Job 2 do?
Time
outage period
1pm 4pm

For Reprocessing, We Are Leaning Towards Option 1
● Messaging system with Tiered Storage
○ Kafka holds recent data (past 24 hours), the rest flows to a
secondary storage like S3. Producers and Consumers would
be able to use this seamlessly, OR
○ Use another product that fulfils this need.

Not a lambda architecture
● Single streaming code base
● Just switch source from Kafka to Hive

Misc
• Hive Side Input for small reference tables
• Structured Data Infrastructure (In Progress…)
• Schema registration, validation and discovery infrastructure
• Notifications of end-to-end breaking schema changesProviding
• Schematized Keystone Pipeline Streams
• Managed State for Materialized shuffle between two jobs

Road Ahead
• Reprocess Data - Rewind or Backfill
• Schema support
• Few out of the box watermark Heuristics
• Low SPaaS user impact Infrastructure upgrades
• Auto scaling
• Easy resource provisioning estimates
• More Reusable blocks like Data hygiene
• Rate Limiter for External Service Calls

Flink at netflix paypal speaker series

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink at netflix paypal speaker series

Similar to Flink at netflix paypal speaker series (20)

Recently uploaded

Recently uploaded (20)

Flink at netflix paypal speaker series