Getting Data In and Out of Flink: Understanding Its Connector Ecosystem

Getting Data In and Out of Flink
Understanding Flink and Its Connector Ecosystem

Real-time services rely on stream processing
Real-time
Data
A Sale
A Shipment
A Trade
Rich Front-End
Customer Experiences
A Customer
Experience
Real-Time Backend
Operations
Real-time Stream Processing

Flink has layered APIs at different levels of
abstractions
Flink SQL
Table API
DataStream API
ProcessFunction Apache Flink Runtime
Low-level Stream Operator API
DataStream
API
ProcessFunction
Table / SQL API
Optimizer /
Planner
Level
of
Abstract
ion
How
the
Code
is
Organized
Flink SQL
High-level, declarative API that allows you to write SQL
queries to process data streams and batch data as
dynamic tables
Table API
Programmatic equivalent of Flink SQL, allowing you to
deﬁne your business logic in either Java or Python, or
combine it with SQL
DataStream API
Low-level, expressive API that exposes the building
blocks for stream processing, giving you direct access to
things like state and timers
ProcessFunction
The most low-level API, allowing for ﬁne-grained
processing of individual elements for complex
event-driven processing logic and state management

Data Processing is a Stream of Changes
● A stream is a sequence of events
● Business data is always a stream: bounded or unbounded
● Batch processing is just a special case in the runtime → An optimised processing
mode is used
now
past future
bounded stream
unbounded stream

One API for both batch and streaming processing
● Reduce cognitive load: developers don’t have to learn multiple APIs
● Consistent semantics across batch and streaming
● Uniﬁed connector APIs for working with external systems
○ Each connector should be able to work as a bounded (batch) and as an
unbounded (continuous streaming) source and/or sink.

Kafka
Databases
Key/Value Stores
Files
Apps
Sources
Real-time Stream Processing
Sinks

Sources and Sinks are part of the JobGraph
GROUP BY color
events
results
COUNT
WHERE color <> orange
4
3

Apache Flink’s Source Interfaces

Low-level: Uniﬁed Source Interface
● Since Flink 1.11, @Public since 1.14. Consists
of:
○ Source
■ Factory - only Serializable interface
○ 1 Split Enumerator per Job
■ Split discovery / split life-cycle
■ Fail-over / split re-assignment
○ 1 Reader per source subtask
■ Offset management
■ Reading
● Abstracted away: threading model, fault
tolerance, cancellation
Main Process
Job
Manager
Split
Enumerator
Task Manager
Source
Reader
Request next split
Assign next
split
Task Manager
Source
Reader
Task Manager
Source
Reader

High-level Source Abstraction: SourceReaderBase
● Alternative approach: track splits continuously
● SourceReaderBase recommended for fewer, concurrent splits (1 thread per split)
● Implements efﬁcient hand-over protocol
● Also for async sources
● Per-split watermark
Source Reader
…
…
…
Split 2N-1
…
Split N-1
…
Split 1
…
Split N-1
…
Split N
…
Split 0
…

Apache Flink’s Sink interfaces

Unified Sink Interface
● Since Flink 1.12, currently @PublicEvolving and @Experimental
● Sink
○ Factory - only Serializable interface
● SinkWriter
○ Writing
● Optional PreCommit Topology
○ E.g. how Hive compacts small files
● Optional Committer
○ For 2-phase commit protocols/transactions
○ Enables exactly-once sink
● Optional PostCommit Topology
○ E.g. Iceberg is compacting the already written files and update the file log after the
compaction.

Unified Sink Interface - Batch
● Implicit support for streaming and batch
● Streaming: Committed on checkpoints
○ Failed committables are retried after some time
○ Committables of failed checkpoints carry over to new checkpoint
■ Only when failure is async (e.g. committables snapshot failed, RPC failure from TM to JM)
■ Synchronization point failures (e.g. flushing) will fail the checkpoint
○ On recovery, recommit all committables of last checkpoint
● Batch: Committed after all data has been processed
○ Indefinite retries on committables

Async Sink
● Recommended starting point for asynchronous sinks - See the blog!
● Convert input to request
● Async Sink bundles into batches
● Send batches asynchronously
● Async Sink resubmits failures
● Convenient batching
● Size-based (when X requests/Y bytes reached)
● Time-based (at least every Z seconds)
● Supports at-least-once
● Abstracts threading model, lock-free implementation

What if you don’t want to write a sink
You could decide to leverage the ASync I/O capabilities in Flink
● Async I/O is meant for enriching (lookup) stream events with data stored in a database
○ But: could use it as a sink
○ Only at-least-once guarantees

Two important connector capabilities

HybridSource
● Use cases: enable initial loads or backﬁll by reading from one (or more) bounded source before
switching to an unbounded source automatically
● Enables switching sources with either predetermined start positions or with position
conversion at switch time.
● Can be used with all sources that use the Uniﬁed Source interfaces

● Watermarks measure progress of event time
● A watermark is an assertion about the completeness of the stream
● Each watermark is the max timestamp seen so far, minus this out-of-orderness estimate
● No splits (partitions/shards) increase their watermarks (when based on event time) too far
ahead of the rest to prevent skew
● Detect idleness and mark an sources/splits/shards/partitions as idle to let Flink watermarks
progress
Watermark alignment and idleness detection

Current state and contributions

Deprecation of Source- and SinkFunction
● Only for streaming
● Monolithic API
● Serializable with non-serializable ﬁelds
● Source
○ Push-based makes checkpointing under backpressure inconsistent
○ Split discovery, data deserialization and emission, offset management
○ Thread synchronization, cancellation
○ Chained tasks also run in the source thread
● Sink
● Data serialization and emission, transaction management
● Hard to write exactly-once
● Async callbacks are hard to handle
● Usually involves synchronization within the SinkFunction

Contributing to Flink connectors
● Good to know:
○ Connectors have been externalized from the Flink repository itself
● The Flink community loves contributions for connectors
○ Build your own and link it on https://ﬂink-packages.org/
○ Check the Jira or open a discussion the Dev mailing list
● Some connectors lack maintainers (E.g. RabbitMQ, PubSub, Elasticsearch) - If you want to
help out, reach out!

Getting Data In and Out of Flink: Understanding Its Connector Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getting Data In and Out of Flink: Understanding Its Connector Ecosystem

Similar to Getting Data In and Out of Flink: Understanding Its Connector Ecosystem (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Getting Data In and Out of Flink: Understanding Its Connector Ecosystem