Click-Through Example for Flink’s KafkaConsumer Checkpointing

•Download as PPTX, PDF•

14 likes•57,911 views

See how Apache Flink's Kafka Consumer is integrating with the checkpointing mechanisms of Flink for exactly once guarantees

Technology

Click-Through Example for
Flink’s KafkaConsumer
Checkpointing

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 0
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 0, 0
This toy example is reading from a Kafka topic with two partitions, each containing “a”, “b”, “c”, … as messages.
The offset is set to 0 for both partitions, a counter is initialized to 0.

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
a
counter = 0
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 1, 0
The Kafka consumer starts reading messages from partition 0. Message “a” is in-flight, the offset for the first
consumer has been set to 1.

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
a
counter = 1
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 2, 1
a
b
Trigger
Checkpoint at
source
Message “a” arrives at the counter, it is set to 1. The consumers both read the next records (“b” and “a”). The
offsets are set accordingly. In parallel, the checkpoint coordinator decides to trigger a checkpoint at the source …

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator a
counter = 2
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 1
a
b
offsets = 2, 1
c
The source has created a snapshot of its state (“offset=2,1”), which is now stored in the checkpoint coordinator.
The sources emitted a checkpoint barrier after messages “a” and “b”.

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 3
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 2
a
b
offsets = 2, 1 counter = 3
c
b
The map operator has received checkpoint barriers from both sources. It checkpoints its state (counter=3) in the
coordinator. At the same time, the consumers are further reading more data from the Kafka partitions.

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 4
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 2
a
offsets = 2, 1 counter = 3
c
b
Notify
checkpoint
complete
The checkpoint coordinator informs the Kafka consumer that the checkpoint has been completed. It commits the
checkpoints offsets into Zookeeper. Note that Flink is not relying on the Kafka offsets in ZK for restoring from failures

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 4
Zookeeper
offset partition 0: 2
offset partition 1: 1
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 2
a
offsets = 2, 1 counter = 3
c
b
Checkpoint in
Zookeeper
The checkpoint is now persisted in Zookeeper. External tools such as the Kafka Offset Checker can see the lag of the
consumer group.

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 5
Zookeeper
offset partition 0: 2
offset partition 1: 1
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 4, 2
offsets = 2, 1 counter = 3
c
b
d
The processing further advances

a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 3
Zookeeper
offset partition 0: 2
offset partition 1: 1
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 2, 1
offsets = 2, 1 counter = 3 Reset all
operators to
last completed
checkpoint
The checkpoint coordinator restores the state at all the operators participating at the checkpointing. The Kafka
sources start from offset 2 and 1, the counter’s value is 3.

Flinkn Forward San Francisco 2022. In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times. by Piotr Nowojski

Tuning Apache Kafka Connectors for Flink.pptx

Flink Forward

Flink Forward San Francisco 2022. In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you! by Olena Babenko

A Deep Dive into Kafka Controller

confluent

Presentation at Strata Data Conference 2018, New York The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure. Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker. Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.

Autoscaling Flink with Reactive Mode

Flink Forward

Flink Forward San Francisco 2022. Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo. by Robert Metzger

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...

Flink Forward

Flink Forward San Francisco 2022. Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink. by Nico Kruber

Changelog Stream Processing with Apache Flink

Flink Forward

Flink Forward San Francisco 2022. The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL. by Timo Walther

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Databricks

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

Evening out the uneven: dealing with skew in Flink

Flink Forward

Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich

Flink Forward San Francisco 2022. The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples." by Thomas Weise

Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...

HostedbyConfluent

Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions. However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions. We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works. In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.

A Day in the Life of a ClickHouse Query Webinar Slides

Altinity Ltd

Why do queries run out of memory? How can I make my queries even faster? How should I size ClickHouse nodes for best cost-efficiency? The key to these questions and many others is knowing what happens inside ClickHouse when a query runs. This webinar is a gentle introduction to ClickHouse internals, focusing on topics that will help your applications run faster and more efficiently. We’ll discuss the basic flow of query execution, dig into how ClickHouse handles aggregation and joins, and show you how ClickHouse distributes processing within a single CPU as well as across many nodes in the network. After attending this webinar you’ll understand how to open up the black box and see what the parts are doing.

kafka

Amikam Snir

Building a fully managed stream processing platform on Flink at scale for Lin...

Flink Forward

Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.

The Current State of Table API in 2022

Flink Forward

Flink Forward San Francisco 2022. The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades. by David Andreson

Improving Kafka at-least-once performance at Uber

Ying Zheng

At Uber, we are seeing an increasing demand for Kafka at-least-once delivery (asks=all). So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When trying to allow at-least-once producing on the regular Kafka clusters, the producing performance was the main concern. We spent some effort on this issue in the recent months, and managed to reduce at-least-once producer latency by about 80% with code changes and configuration tuning. When acks=0, these improvements also help increasing Kafka throughput and reducing Kafka end-to-end latency.

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

Flink Forward

Flink Forward San Francisco 2022. Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover. by Mason Chen

DBA Fundamentals Group: Continuous SQL with Kafka and Flink

Timothy Spann

DBA Fundamentals Group: Continuous SQL with Kafka and Flink 20-Feb-2024 In this talk, I will walk through how someone can set up and run continuous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas, and publishing data. We will then cover consuming Kafka data, joining Kafka topics, and inserting new events into Kafka topics as they arrive. This basic overview will show hands-on techniques, tips, and examples of how to do this. Tim Spann Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.

Spark shuffle introduction

colorant

Building an Activity Feed with Cassandra

Mark Dunphy

Battle of the Stream Processing Titans – Flink versus RisingWave

Yingjun Wu

The world of real-time data processing is constantly evolving, with new technologies and platforms emerging to meet the ever-increasing demands of modern data-driven businesses. Apache Flink and RisingWave are two powerful stream processing solutions that have gained significant traction in recent years. But which platform is right for your organization? Karin Wolok and Yingjun Wu go head-to-head to compare and contrast the strengths and limitations of Flink and RisingWave. They’ll also share real-world use cases, best practices for optimizing performance and efficiency, and key considerations for selecting the right solution for your specific business needs.

Storing State Forever: Why It Can Be Good For Your Analytics

Yaroslav Tkachenko

State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL? At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.

The top 3 challenges running multi-tenant Flink at scale

Flink Forward

Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.

Handle Large Messages In Apache Kafka

Jiangjie Qin

Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.

Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...

HostedbyConfluent

More and more Enterprises are relying on Apache Kafka to run their businesses. Cluster administrators need the ability to mirror data between clusters to provide high availability and disaster recovery. MirrorMaker 2, released recently as part of Kafka 2.4.0, allows you to mirror multiple clusters and create many replication topologies. Learn all about this awesome new tool and how to reliably and easily mirror clusters. We will first describe how MirrorMaker 2 works, including how it addresses all the shortcomings of MirrorMaker 1. We will also cover how to decide between its many deployment modes. Finally, we will share our experience running it in production as well as our tips and tricks to get a smooth ride.

Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf

Ververica

The need to enrich a fast, high volume data stream with slow-changing reference data is probably one of the most wide-spread requirements in stream processing applications. Apache Flink's built-in join functionalities and its flexible lower-level APIs support stream enrichment in various ways depending on the specific requirements of the use case at hand. In this webinar, I like to provide an overview of the basic methods to enrich a data stream with Apache Flink and highlight use cases, limitations, advantages and disadvantages of each.

Unified Stream and Batch Processing with Apache Flink

DataWorks Summit/Hadoop Summit

Apache Flink in the Cloud-Native Era

Flink Forward

Flink Forward San Francisco 2022. This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications. We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs. After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are. by David Moravek

Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...

Flink Forward

http://flink-forward.org/kb_sessions/connecting-apache-flink-with-the-world-reviewing-the-streaming-connectors/ Getting data in and out of Flink in a reliable fashion is one of the most important tasks of a stream processor. This talk will review the most important and frequently used connectors in Flink. Apache Kafka and Amazon Kinesis Streams both fall into the same category of distributed, high-throughput and durable publish-subscribe messaging systems. The talk will explain how the connectors in Flink for these systems are implemented. In particular we’ll focus on how we ensure exactly-once semantics while consuming data and how offsets/sequence numbers are handled. We will also review two generic tools in Flink for connectors: A message acknowledging source for classical message queues (like those implementing AMQP) and a generic write ahead log sink, using Flink’s state backend abstraction. The objective of the talk is to explain the internals of the streaming connectors, so that people can understand their behavior, configure them properly and implement their own connectors.

Flow control

STEFFY D

What's hot

Fundamentals of Apache Kafka

Chhavi Parasher

Introducing the Apache Flink Kubernetes Operator

Flink Forward

Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...

HostedbyConfluent

A Day in the Life of a ClickHouse Query Webinar Slides

Altinity Ltd

kafka

Amikam Snir

Building a fully managed stream processing platform on Flink at scale for Lin...

Flink Forward

The Current State of Table API in 2022

Flink Forward

Improving Kafka at-least-once performance at Uber

Ying Zheng

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

Flink Forward

DBA Fundamentals Group: Continuous SQL with Kafka and Flink

Timothy Spann

Spark shuffle introduction

colorant

Building an Activity Feed with Cassandra

Mark Dunphy

Battle of the Stream Processing Titans – Flink versus RisingWave

Yingjun Wu

Storing State Forever: Why It Can Be Good For Your Analytics

Yaroslav Tkachenko

The top 3 challenges running multi-tenant Flink at scale

Flink Forward

Handle Large Messages In Apache Kafka

Jiangjie Qin

Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...

HostedbyConfluent

Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf

Ververica

Unified Stream and Batch Processing with Apache Flink

DataWorks Summit/Hadoop Summit

Apache Flink in the Cloud-Native Era

Flink Forward

What's hot (20)

Fundamentals of Apache Kafka

Introducing the Apache Flink Kubernetes Operator

Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...

A Day in the Life of a ClickHouse Query Webinar Slides

kafka

Building a fully managed stream processing platform on Flink at scale for Lin...

The Current State of Table API in 2022

Improving Kafka at-least-once performance at Uber

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

DBA Fundamentals Group: Continuous SQL with Kafka and Flink

Spark shuffle introduction

Building an Activity Feed with Cassandra

Battle of the Stream Processing Titans – Flink versus RisingWave

Storing State Forever: Why It Can Be Good For Your Analytics

The top 3 challenges running multi-tenant Flink at scale

Handle Large Messages In Apache Kafka

Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...

Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf

Unified Stream and Batch Processing with Apache Flink

Apache Flink in the Cloud-Native Era

Similar to Click-Through Example for Flink’s KafkaConsumer Checkpointing

Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...

Flink Forward

Flow control

STEFFY D

Stephan Ewen - Scaling to large State

Flink Forward

http://flink-forward.org/kb_sessions/scaling-stream-processing-with-apache-flink-to-very-large-state/ The majority of streaming programs is ‘stateful’: Windowed Aggregations, Sessions, Joins, Complex Event Processing, Tables – they all require to keep some form of state across individual events. With the migration of more and more complex batch jobs or data processing pipelines to streaming applications, some streaming programs need to keep terabytes of state. Apache Flink implements a checkpointing-based recovery mechanism that guarantees exactly-once semantics for state also in the presence of failures. The cost of checkpointing and recovery depends on the size of the program’s state. In this talk, we will discuss the current status of stateful processing in Apache Flink, as well as the ongoing efforts to make Flink’s fault tolerance mechanism scale to very large state sizes, supporting frequent checkpoints and faster recovery of large state, without requiring excessive numbers of machines.

Flow control

steffy D

Transport layer

steffy1996

CCDT(client connection)MQ.docx

sarvank2

Evolution of kube-proxy (Brussels, Fosdem 2020)

Laurent Bernaille

Kube-proxy enables access to Kubernetes services (virtual IPs backed by pods) by configuring client-side load-balancing on nodes. The first implementation relied on a userspace proxy which was not very performant. The second implementation used iptables and is still the one used in most Kubernetes clusters. Recently, the community introduced an alternative based on IPVS. This talk will start with a description of the different modes and how they work. It will then focus on the IPVS implementation, the improvements it brings, the issues we encountered and how we fixed them as well as the remaining challenges and how they could be addressed. Finally, the talk will present alternative solutions based on eBPF such as Cilium.

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...

HostedbyConfluent

This talk is aimed to give developers who are interested to scale their streaming application with Exactly-Once (EOS) guarantees. Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems. To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library. We would also present how the EOS code can be simplified with plain Producer and Consumer. Come to learn more if you wish to adopt this improved EOS feature and get started on building your own EOS application today!

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...

Guozhang Wang

Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems. To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library.

5-LEC- 5.pptxTransport Layer. Transport Layer Protocols

ZahouAmel1

Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transport Layer. Transport Layer Protocols Transpor

Lecture 5

ntpc08

Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...

HostedbyConfluent

"Apache Flink’s Exactly-Once Semantics (EOS) integration for writing to Apache Kafka has several pitfalls, due mostly to the fact that the Kafka transaction protocol was not originally designed with distributed transactions in mind. The integration uses Java reflection hacks as a workaround, and the solution can still result in data loss under certain scenarios. Can we do better? In this session, you’ll see how the Flink and Kafka communities are uniting to tackle these long-standing technical debts. We’ll introduce the basics of how Flink achieves EOS with external systems and explore the common hurdles that are encountered when implementing distributed transactions. Then we’ll dive into the details of the proposed changes to both the Kafka transaction protocol and Flink transaction coordination that seek to provide a more robust integration. By the end of the talk, you’ll know the unique challenges of EOS with Flink and Kafka and the improvements you can expect across both projects."

More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn

Celia Kung

For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address these, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin’s mirroring solution aims to provide improved performance and stability, while facilitating better management via finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin and our plans for iterating further on this new mirroring solution.

GEC21_DataPlanePerformanceCharacterizationLong Tran

Business Continuity and Load Balancing

MarioMastrodicasa

The slides are a revised version of the presentation made as a proposal for Kafka Summit. Here below the abstract sent: Business continuity and load-balancing are key features in modern manufacturing and transportation environments. Manage real-time, scalable and distributed systems is a challenge for every software architect. In the presentation, we demonstrate how Apache Kafka helped us to manage real-time load-balancing and business continuity across multiple datacenters. Our platform is entirely based on message exchange between interconnected plugins: external I/O and internal plugins communicate using a message based interface. A similar logic can be found in Apache Kafka: giving to our platform an implementation of message flow based on Apache Kafka given us the possibility to distribute, across all datacenters involved, the messages exchanged internally between plugins. The internal queue becomes a “Kafka multicast message queue” with multiple producer/consumer. Configuring which plugins interconnection shall use a “Kafka queue”, and which producer/consumer is active, it is possible to obtain both business continuity and load balancing. Manufacturing and transportation environments communicate with sensors and actuators within the field: it is mandatory that every event is processed and no event is lost. In most cases sensors and actuators are redundant, but only one can be active at the same time. To reach this objective the entire platform has a global orchestrator in charge to arbitrate all plugins across all datacenters: using ZooKeeper distributed lock recipe the orchestrator is able to manage a global lock and, modifying producer/consumer active state, select continually which plugin is in charge to execute the work. Again the orchestrator has a global “objective function” which try to optimize the balancing of load between datacenters, ensuring meantime the business continuity in case of fault condition or disaster. Since our platform is based on OPC-UA technology: Apache Kafka helped us to implement transparent redundant behavior of OPC-UA servers using same technic.

Os3issbp

SystemC Ports

敬倫林

Cisco Openflow

Vijayaguru Jayaram

Arq protocol part 2Aanandha Saravanan

Flow Control and Error Control

Minhazul Abedin Munna

Similar to Click-Through Example for Flink’s KafkaConsumer Checkpointing (20)

Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...

Flow control

Stephan Ewen - Scaling to large State

Flow control

Transport layer

CCDT(client connection)MQ.docx

Evolution of kube-proxy (Brussels, Fosdem 2020)

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...

5-LEC- 5.pptxTransport Layer. Transport Layer Protocols

Lecture 5

Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...

More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn

GEC21_DataPlanePerformanceCharacterization

Business Continuity and Load Balancing

Os3

SystemC Ports

Cisco Openflow

Arq protocol part 2

Flow Control and Error Control

More from Robert Metzger

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)

Click-Through Example for Flink’s KafkaConsumer Checkpointing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Click-Through Example for Flink’s KafkaConsumer Checkpointing

Similar to Click-Through Example for Flink’s KafkaConsumer Checkpointing (20)

More from Robert Metzger

More from Robert Metzger (20)

Recently uploaded

Recently uploaded (20)

Click-Through Example for Flink’s KafkaConsumer Checkpointing