Presented By:
Kundan Kumar
Software Consultant
Stateful Stream
Processing with
Apache Flink
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Mute
Be on mute until you have
questions or concerns.
Agenda
01 What is stateful stream processing
02 Flink takes on stateful stream processing
Demo
03
What is Stateful Stream Processing?
Streaming and Stream Processing:
Stream processing is the processing of data in motion, or in other words, computing on data directly
as it is produced or received.
The systems that receive and send the data streams and execute the application or analytics logic are
called stream processors.
Stateful Stream Processing:
Stateful stream processing is a subset of stream processing in which the computation maintains
contextual state. This state is used to store information derived from the previously-seen events.
Stateful stream processing means a “State” is shared between events(stream entities). And therefore
past events can influence the way the current events are processed.
Flink takes on stateful stream processing
Flink in nutshell-
● Apache Flink is a Big Data framework and distributed processing engine for stateful
computations over unbounded and bounded data streams.
➢ A Flink application may consume real-time data from streaming sources such as
message queues or distributed logs, like Apache Kafka or Kinesis.
➢ Flink can also consume bounded, historic data from a variety of data sources.
➢ The streams of results being produced by a Flink application can be sent to a wide
variety of systems that can be connected as sinks.
➢ Fast, In memory, scalable, large state, fault tolerant, event time, exactly once.
Source
Transformations
Sink
➢ Programs in Flink are inherently parallel and distributed.
➢ During execution, a stream has one or more stream partitions, and each
operator has one or more operator subtasks.
➢ Flink facilitate stateful operations.
➢ Current handling event can depend on the accumulated effect of all the events that
came before it.
➢ The set of parallel instances of a stateful operator is effectively a sharded key-value
store. Each parallel instance is responsible for handling events for a specific group of
keys, and the state for those keys is kept locally.
States in Flink
➢ Operator State: State is maintained on per operator basis on stream. Special type of
state used in source and sink implementations.
➢ Keyed State: Maintaining state on per key basis on stream. Stores state associated
with the same key. Embedded key value store.
➢ Broadcast State: Special type of operator state used where records of one stream will
be broadcast to all downstream task which needs access to those records.
➢ Queryable State: Feature that allow client API’s to query Jobstate from outside Flink.
Stateful streaming application in Flink
State Backends
1. Memory state backend:
➢ This is the default backend used by Flink in case nothing is configured.
➢ Persists the data in the memory of each task manager’s Heap.
➢ This state should never be used in production jobs.
➢ The state creates a backup of the data (also known as checkpointing) in the job
manager memory which puts unnecessary pressure on the job manager's operational
stability.
2. File System Backend
➢ This backend is similar to Memory state backend except, it stores the backup on the
filesystem rather than job manager’s memory.
➢ The filesystem can be task manager's local filesystem or a durable store such as
HDFS/S3.
3. RocksDB backend
➢ This backend uses RocksDB by Facebook to store the data
➢ RocksDB maintains an in-memory table (also known as mem-table) along with bloom
filters, reading recent data also is extremely fast.
➢ Each task manager maintains its own Rocks DB file and the backup of this state is
checkpointed to a durable store such as HDFS/S3.
➢ This is the only backend which offers support for incremental checkpointing i.e. taking a
backup of only modified data rather than complete data.
Checkpointing
Checkpoint: Specific marked point in each input stream from which stream can
replayed. Flink implements it by persisting state of all stateful operator. Periodically
save state to reliable storage system.
Stream Barriers: Lightweight stream marker with unique ID’s. Injected by Flink into
input stream and flow with stream in line.
Checkpointing mechanism
Aligned Checkpointing-
Unaligned Checkpointing-
Demo
Q/A
References
1. https://flink.apache.org
2. https://ci.apache.org/projects/flink/flink-docs-release-1.12/con
cepts/stateful-stream-processing.html#unaligned-checkpointin
g
3. Book: Learning Apache Flink By Tanmay Deshpande
Thank You !

Stateful stream processing with Apache Flink

  • 1.
    Presented By: Kundan Kumar SoftwareConsultant Stateful Stream Processing with Apache Flink
  • 2.
    Lack of etiquetteand manners is a huge turn off. KnolX Etiquettes Punctuality Respect Knolx session timings, you are requested not to join sessions after a 5 minutes threshold post the session start time. Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Mute Be on mute until you have questions or concerns.
  • 3.
    Agenda 01 What isstateful stream processing 02 Flink takes on stateful stream processing Demo 03
  • 4.
    What is StatefulStream Processing? Streaming and Stream Processing: Stream processing is the processing of data in motion, or in other words, computing on data directly as it is produced or received. The systems that receive and send the data streams and execute the application or analytics logic are called stream processors.
  • 5.
    Stateful Stream Processing: Statefulstream processing is a subset of stream processing in which the computation maintains contextual state. This state is used to store information derived from the previously-seen events. Stateful stream processing means a “State” is shared between events(stream entities). And therefore past events can influence the way the current events are processed.
  • 6.
    Flink takes onstateful stream processing Flink in nutshell- ● Apache Flink is a Big Data framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
  • 7.
    ➢ A Flinkapplication may consume real-time data from streaming sources such as message queues or distributed logs, like Apache Kafka or Kinesis. ➢ Flink can also consume bounded, historic data from a variety of data sources. ➢ The streams of results being produced by a Flink application can be sent to a wide variety of systems that can be connected as sinks. ➢ Fast, In memory, scalable, large state, fault tolerant, event time, exactly once.
  • 8.
  • 9.
    ➢ Programs inFlink are inherently parallel and distributed. ➢ During execution, a stream has one or more stream partitions, and each operator has one or more operator subtasks.
  • 10.
    ➢ Flink facilitatestateful operations. ➢ Current handling event can depend on the accumulated effect of all the events that came before it. ➢ The set of parallel instances of a stateful operator is effectively a sharded key-value store. Each parallel instance is responsible for handling events for a specific group of keys, and the state for those keys is kept locally.
  • 12.
    States in Flink ➢Operator State: State is maintained on per operator basis on stream. Special type of state used in source and sink implementations. ➢ Keyed State: Maintaining state on per key basis on stream. Stores state associated with the same key. Embedded key value store. ➢ Broadcast State: Special type of operator state used where records of one stream will be broadcast to all downstream task which needs access to those records. ➢ Queryable State: Feature that allow client API’s to query Jobstate from outside Flink.
  • 13.
  • 15.
    State Backends 1. Memorystate backend: ➢ This is the default backend used by Flink in case nothing is configured. ➢ Persists the data in the memory of each task manager’s Heap. ➢ This state should never be used in production jobs. ➢ The state creates a backup of the data (also known as checkpointing) in the job manager memory which puts unnecessary pressure on the job manager's operational stability.
  • 16.
    2. File SystemBackend ➢ This backend is similar to Memory state backend except, it stores the backup on the filesystem rather than job manager’s memory. ➢ The filesystem can be task manager's local filesystem or a durable store such as HDFS/S3. 3. RocksDB backend ➢ This backend uses RocksDB by Facebook to store the data ➢ RocksDB maintains an in-memory table (also known as mem-table) along with bloom filters, reading recent data also is extremely fast. ➢ Each task manager maintains its own Rocks DB file and the backup of this state is checkpointed to a durable store such as HDFS/S3. ➢ This is the only backend which offers support for incremental checkpointing i.e. taking a backup of only modified data rather than complete data.
  • 17.
    Checkpointing Checkpoint: Specific markedpoint in each input stream from which stream can replayed. Flink implements it by persisting state of all stateful operator. Periodically save state to reliable storage system. Stream Barriers: Lightweight stream marker with unique ID’s. Injected by Flink into input stream and flow with stream in line.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.