This document discusses Flink's connector ecosystem and how to get data in and out of Flink. It describes Flink's layered APIs, including Flink SQL, Table API, DataStream API, and ProcessFunction. It also covers Flink's source and sink interfaces, including the unified source and sink interfaces, hybrid sources, and watermark alignment. The document provides guidance on when to write custom connectors versus leveraging Flink's async I/O capabilities. It notes that Source- and SinkFunction are being deprecated and encourages contributing to existing connectors or building new ones.
Getting Data In and Out of Flink: Understanding Its Connector Ecosystem
1. Getting Data In and Out of Flink
Understanding Flink and Its Connector Ecosystem
2. Real-time services rely on stream processing
Real-time
Data
A Sale
A Shipment
A Trade
Rich Front-End
Customer Experiences
A Customer
Experience
Real-Time Backend
Operations
Real-time Stream Processing
3. Flink has layered APIs at different levels of
abstractions
Flink SQL
Table API
DataStream API
ProcessFunction Apache Flink Runtime
Low-level Stream Operator API
DataStream
API
ProcessFunction
Table / SQL API
Optimizer /
Planner
Level
of
Abstract
ion
How
the
Code
is
Organized
Flink SQL
High-level, declarative API that allows you to write SQL
queries to process data streams and batch data as
dynamic tables
Table API
Programmatic equivalent of Flink SQL, allowing you to
define your business logic in either Java or Python, or
combine it with SQL
DataStream API
Low-level, expressive API that exposes the building
blocks for stream processing, giving you direct access to
things like state and timers
ProcessFunction
The most low-level API, allowing for fine-grained
processing of individual elements for complex
event-driven processing logic and state management
4. Data Processing is a Stream of Changes
● A stream is a sequence of events
● Business data is always a stream: bounded or unbounded
● Batch processing is just a special case in the runtime → An optimised processing
mode is used
now
past future
bounded stream
unbounded stream
5. One API for both batch and streaming processing
● Reduce cognitive load: developers don’t have to learn multiple APIs
● Consistent semantics across batch and streaming
● Unified connector APIs for working with external systems
○ Each connector should be able to work as a bounded (batch) and as an
unbounded (continuous streaming) source and/or sink.
13. Unified Sink Interface
● Since Flink 1.12, currently @PublicEvolving and @Experimental
● Sink
○ Factory - only Serializable interface
● SinkWriter
○ Writing
● Optional PreCommit Topology
○ E.g. how Hive compacts small files
● Optional Committer
○ For 2-phase commit protocols/transactions
○ Enables exactly-once sink
● Optional PostCommit Topology
○ E.g. Iceberg is compacting the already written files and update the file log after the
compaction.
14. Unified Sink Interface - Batch
● Implicit support for streaming and batch
● Streaming: Committed on checkpoints
○ Failed committables are retried after some time
○ Committables of failed checkpoints carry over to new checkpoint
■ Only when failure is async (e.g. committables snapshot failed, RPC failure from TM to JM)
■ Synchronization point failures (e.g. flushing) will fail the checkpoint
○ On recovery, recommit all committables of last checkpoint
● Batch: Committed after all data has been processed
○ Indefinite retries on committables
15. Async Sink
● Recommended starting point for asynchronous sinks - See the blog!
● Convert input to request
● Async Sink bundles into batches
● Send batches asynchronously
● Async Sink resubmits failures
● Convenient batching
● Size-based (when X requests/Y bytes reached)
● Time-based (at least every Z seconds)
● Supports at-least-once
● Abstracts threading model, lock-free implementation
16. What if you don’t want to write a sink
You could decide to leverage the ASync I/O capabilities in Flink
● Async I/O is meant for enriching (lookup) stream events with data stored in a database
○ But: could use it as a sink
○ Only at-least-once guarantees
18. HybridSource
● Use cases: enable initial loads or backfill by reading from one (or more) bounded source before
switching to an unbounded source automatically
● Enables switching sources with either predetermined start positions or with position
conversion at switch time.
● Can be used with all sources that use the Unified Source interfaces
19. ● Watermarks measure progress of event time
● A watermark is an assertion about the completeness of the stream
● Each watermark is the max timestamp seen so far, minus this out-of-orderness estimate
● No splits (partitions/shards) increase their watermarks (when based on event time) too far
ahead of the rest to prevent skew
● Detect idleness and mark an sources/splits/shards/partitions as idle to let Flink watermarks
progress
Watermark alignment and idleness detection
21. Deprecation of Source- and SinkFunction
● Only for streaming
● Monolithic API
● Serializable with non-serializable fields
● Source
○ Push-based makes checkpointing under backpressure inconsistent
○ Split discovery, data deserialization and emission, offset management
○ Thread synchronization, cancellation
○ Chained tasks also run in the source thread
● Sink
● Data serialization and emission, transaction management
● Hard to write exactly-once
● Async callbacks are hard to handle
● Usually involves synchronization within the SinkFunction
22. Contributing to Flink connectors
● Good to know:
○ Connectors have been externalized from the Flink repository itself
● The Flink community loves contributions for connectors
○ Build your own and link it on https://flink-packages.org/
○ Check the Jira or open a discussion the Dev mailing list
● Some connectors lack maintainers (E.g. RabbitMQ, PubSub, Elasticsearch) - If you want to
help out, reach out!