Logs are one of the most important sources to monitor and reveal some significant events of interest. In this presentation, we introduced an implementation of log streams processing architecture based on Apache Flink. With fluentd, different kinds of emitted logs are collected and sent to Kafka. After having processed by Flink, we try to build a dash board utilizing elasticsearch and kibana for visualization.
Flink Connector Development Tips & TricksEron Wright
A look at some of the challenges and techniques for developing a connector for Apache Flink, covering the different types of connectors, lifecycle, metrics, event-time support, and fault tolerance.
Presentation video: https://www.youtube.com/watch?v=ZkbYO5S4z18
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Flink Forward
As organizations are getting better at capturing streaming data and the data velocity and volume are ever-increasing, the traditional messaging queues or log storage systems are suffering from scalability or operational and maintenance problems. Apache Pulsar is a multi-tenant, high-performance distributed pub-sub messaging system. Pulsar includes multiple features, such as native support for multiple clusters in a Pulsar instance, seamless geo-replication of messages across clusters, very low publishing and end-to-end latency, seamless scalability to over a million topics, and guaranteed message delivery with persistent message storage provided by Apache BookKeeper. In this talk, I will use one of the most popular stream processing engines, Apache Flink, as an example, to share our experience in building a stream processing and storage stack. Some of the traits are: * How to ensure end-to-end exactly-once semantics based on Pulsar's durable and replayable storage as well as Pulsar transaction. * How to implement Pulsar topics as infinite tables based on Pulsar's schema. * How to efficiently store stream states in Pulsar based on Pulsar's layered storage API. * A usage scenario that chaining all functionalities in the streaming platform.
As more and more organizations and individual users turn to Apache Flink for their streaming workloads, there is a bigger demand for additional functionality out-of-the-box. On one hand, there is demand for more low-level APIs that allow for more control, while on the other, users ask for more high-level additions that make the common cases easier to express. This talk will present the new concepts added to the Datastream API in Flink-1.2 and for the upcoming Flink-1.3 release that tried to consolidate the aforementioned goals. We will talk, among others, about the ProcessFunction, a new low level stream processing primitive that gives the user full control over how each event is processed and can register and react to timers, changes in the windowing logic that allow for more flexible windowing strategies, side outputs, and new features concerning the Flink connectors.
As more and more organizations and individual users turn to Apache Flink for their streaming workloads, there is a bigger demand for additional functionality out-of-the-box. On one hand, there is demand for more low-level APIs that allow for more control, while on the other, users ask for more high-level additions that make the common cases easier to express. This talk will present the new concepts added to the Datastream API in Flink-1.2 and for the upcoming Flink-1.3 release that tried to consolidate the aforementioned goals. We will talk, among others, about the ProcessFunction, a new low level stream processing primitive that gives the user full control over how each event is processed and can register and react to timers, changes in the windowing logic that allow for more flexible windowing strategies, side outputs, and new features concerning the Flink connectors.
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward
Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we can observe that the scale of state that is managed by Flink in production constantly grows. This leads to a couple of interesting challenges for state handling in Flink. In this talk, we presents current and future developments to improve the handling of large state and recovery in Apache Flink. We show how to keep snapshots for large state swift and how to minimize negative effects on job performance through incremental and asynchronous checkpointing. Furthermore, we discuss how to greatly accelerate recovery under failures and for rescaling. In this context, we go into details about improved execution graph recovery, caching state on task managers, and considering new features of modern storage architectures for our state backends.
Logs are one of the most important sources to monitor and reveal some significant events of interest. In this presentation, we introduced an implementation of log streams processing architecture based on Apache Flink. With fluentd, different kinds of emitted logs are collected and sent to Kafka. After having processed by Flink, we try to build a dash board utilizing elasticsearch and kibana for visualization.
Flink Connector Development Tips & TricksEron Wright
A look at some of the challenges and techniques for developing a connector for Apache Flink, covering the different types of connectors, lifecycle, metrics, event-time support, and fault tolerance.
Presentation video: https://www.youtube.com/watch?v=ZkbYO5S4z18
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Flink Forward
As organizations are getting better at capturing streaming data and the data velocity and volume are ever-increasing, the traditional messaging queues or log storage systems are suffering from scalability or operational and maintenance problems. Apache Pulsar is a multi-tenant, high-performance distributed pub-sub messaging system. Pulsar includes multiple features, such as native support for multiple clusters in a Pulsar instance, seamless geo-replication of messages across clusters, very low publishing and end-to-end latency, seamless scalability to over a million topics, and guaranteed message delivery with persistent message storage provided by Apache BookKeeper. In this talk, I will use one of the most popular stream processing engines, Apache Flink, as an example, to share our experience in building a stream processing and storage stack. Some of the traits are: * How to ensure end-to-end exactly-once semantics based on Pulsar's durable and replayable storage as well as Pulsar transaction. * How to implement Pulsar topics as infinite tables based on Pulsar's schema. * How to efficiently store stream states in Pulsar based on Pulsar's layered storage API. * A usage scenario that chaining all functionalities in the streaming platform.
As more and more organizations and individual users turn to Apache Flink for their streaming workloads, there is a bigger demand for additional functionality out-of-the-box. On one hand, there is demand for more low-level APIs that allow for more control, while on the other, users ask for more high-level additions that make the common cases easier to express. This talk will present the new concepts added to the Datastream API in Flink-1.2 and for the upcoming Flink-1.3 release that tried to consolidate the aforementioned goals. We will talk, among others, about the ProcessFunction, a new low level stream processing primitive that gives the user full control over how each event is processed and can register and react to timers, changes in the windowing logic that allow for more flexible windowing strategies, side outputs, and new features concerning the Flink connectors.
As more and more organizations and individual users turn to Apache Flink for their streaming workloads, there is a bigger demand for additional functionality out-of-the-box. On one hand, there is demand for more low-level APIs that allow for more control, while on the other, users ask for more high-level additions that make the common cases easier to express. This talk will present the new concepts added to the Datastream API in Flink-1.2 and for the upcoming Flink-1.3 release that tried to consolidate the aforementioned goals. We will talk, among others, about the ProcessFunction, a new low level stream processing primitive that gives the user full control over how each event is processed and can register and react to timers, changes in the windowing logic that allow for more flexible windowing strategies, side outputs, and new features concerning the Flink connectors.
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward
Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we can observe that the scale of state that is managed by Flink in production constantly grows. This leads to a couple of interesting challenges for state handling in Flink. In this talk, we presents current and future developments to improve the handling of large state and recovery in Apache Flink. We show how to keep snapshots for large state swift and how to minimize negative effects on job performance through incremental and asynchronous checkpointing. Furthermore, we discuss how to greatly accelerate recovery under failures and for rescaling. In this context, we go into details about improved execution graph recovery, caching state on task managers, and considering new features of modern storage architectures for our state backends.
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward
Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics may include queryable state, dynamic scaling, streaming SQL, very large state support, and whatever is the latest and greatest in March 2017.
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...Flink Forward
This talk will focus on how to package, distribute and deploy Flink Jobs by leveraging existing docker technology: Previously deploying of Flink Jobs has been a manual job which leads into errors. In this talk, we present an approach which works well in an CI/CD environment by automating most steps: From the code of a Flink Job in a repository to a running Job on an YARN cluster.
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward
Witnessing the rise of stream processing from the driving seat, we see Apache Flink® and associated technologies used for a wide variety of business applications, from routing data through systems, serving as a backbone for real-time analytics on live data using SQL, detecting credit card fraud, to implementing complete end-to-end social networks. Such applications enable modern data-driven businesses where decisions and actions happen in real-time, and transform traditional businesses to become more data-driven. Observing the variety of these applications implemented using Flink, it becomes apparent that the traditional dividing line between analytics and operational applications is becoming more and more blurry. Historically, operational applications were built using transactional databases, and analytics were done offline. In contrast, Flink’s, state, checkpoints, and time management are the core building blocks for both operational applications with strong data consistency needs, and for real-time analytics with correctness guarantees. With these shared building blocks, developers start building what is arguably a new class of data-driven applications: applications that are operational in that they serve live systems and at the same time analytical in that they perform complex data analysis. Following application architectures like CQRS and using new features like Flink’s queryable state, streaming analytics and online applications move even closer to each other. In this talk, guided by real-world use cases, we present how the unique core concepts behind Flink simplify the development, deployment, and management of data-driven applications, and we conclude with a vision for the future for Flink and stream processing.
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Flink Forward
http://flink-forward.org/kb_sessions/flink-in-genomics-efficient-and-scalable-processing-of-raw-illumina-bcl-data/
A single run in genome sequencing can easily produce several terabytes of data, which subsequently feed a complex pipeline of tools. Typically, the first step in this workflow is a rearrangement of data, roughly equivalent to a matrix transposition, to reconstruct the original DNA fragments from the raw BCL data, where the fragments are sliced and scattered over multiple files. This step is followed by the sorting of the fragments by a specific identifying tag sequence, which is attached during the preparation of the sample. In this talk we will present a parallel program which performs these essential operations. Our BCL converter is shown to have comparable performance to the shared-memory Illumina bcl2fastq tool, while also enabling easy and scalable distributed-memory parallelization. We will describe the techniques we have used to achieve high performance and discuss the features of Flink which we have particularly appreciated as well as the ones which we think are still missing.
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Stream Loops on Flink - Reinventing the wheel for the streaming eraParis Carbone
This document discusses adding iterative processing capabilities to stream processing systems like Apache Flink. It proposes programming model extensions that treat iterative computations as structured loops over windows. Progress would be tracked using progress timestamps rather than watermarks to allow for arbitrary loop structures. Challenges include managing state and cyclic flow control to avoid deadlocks while encouraging iteration completion.
Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...Ververica
Learn how the combination of Apache Kafka and Apache Flink is making stateful stream processing even more expressive and flexible to support applications in streaming that were previously not considered streamable.
The new world of applications and fast data architectures has broken up the database: Raw data persistence comes in the form of event logs, and the state of the world is computed by a stream processor. Apache Kafka provides a strong solution for the event log, while Apache Flink forms a powerful foundation for the computation over the event streams.
In this talk we discuss how Flink’s abstraction and management of application state have evolved over time and how Flink’s snapshot persistence model and Kafka’s log work together to form a base to build ‘versioned applications’. We will also show how end-to-end exactly-once processing works through a smart integration of Kafka’s transactions and Flink’s checkpointing mechanism.
Flink Forward Berlin 2017: Matt Zimmer - Custom, Complex Windows at Scale Usi...Flink Forward
The windowing capabilities offered by most stream processing engines are limited to aligned windows of a fixed duration. However, many real-world event processing use cases don’t fit this rigid structure, resulting in awkward processing pipelines. There haven’t been good alternatives, until recently that is. Apache Flink offers a rich Window API that supports implementing unaligned windows of varying duration. In this talk, Matt Zimmer will discuss using this API at Netflix to aggregate events into windows customized along varying definitions of a session. He will talk about implementation details such as: * Handling out-of-order events * Limiting state build-up while aggregating a subset of events from an event stream * Periodically emitting early results * Creating windows bounded by a type of event Attendees will leave this talk with practical techniques and knowledge to implement their own custom windows in Apache Flink.
In this talk, we describe the design and implementation of the Python Streaming API support that has been submitted for inclusion in mainline Flink. Python is one of the most popular programming languages for data analysis. Its readability emphasizes development productivity and as a scripting language, it does not require a compilation nor complex development environment setup. Flink already has support for Python APIs for batch programming and unfortunately, the mechanism used to support batch programs (i.e., DataSet APIs) do does not work for Streaming API. We describe the limitations with the batch implementation and provide insights into how we solved this using Jython. We will walk through some of the examples programs using the new Python API and compare programmability and performance with the Java and Scala streaming APIs.
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Ververica
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays.
The fact that stream processing is gaining rapid adoption is also due to more powerful and maturing technology (much of it open source at the ASF) that has solved many of the hard technical challenges.
We discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward
Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API. SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined. In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkVerverica
As Apache Flink continues to push the boundaries of stateful stream processing as an integral part of its past releases, increasing numbers of users are starting to realize the potential of stateful stream processing as a promising paradigm for robust and reactive data analytics as well as event-driven applications.
This talk aims at covering the general idea and motivations of stateful stream processing, and how Flink enables it with its powerful set of state management features and programming APIs. In addition to that, we will also take a look at the recent advancements related to Flink's state management and large state handling that were driven by our team at data Artisans team in the latest version 1.3 (expected release by end of May / early June).
This document discusses the C100K problem of handling 100,000 concurrent network connections efficiently and describes how the Go programming language solves this problem. It explains that Go uses lightweight goroutines instead of OS threads, has a fast scheduler, and uses non-blocking I/O with epoll to efficiently handle a large number of clients with a small memory footprint on each CPU core. An example TCP/HTTP server is shown to demonstrate how Go implements networking.
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
Apache Flink, a powerful distributed stateful stream processing framework, is an especially good fit for deployment on a containerization platform: its storage requirement is primarily external (e.g. HDFS or S3), clusters often share the lifetime of the jobs that run on them, and the flexibility of allocating resources on such a platform allows for scaling jobs up and down as necessary. In this talk I will give a brief introduction to Apache Flink, then describe the journey to making it a first-class citizen of the container world. I will cover my experience preparing to publish the “official repository” of Flink images on Docker Hub, the challenges of fitting a Flink deployment in a Kubernetes-shaped box, and the rough edges of Flink itself that were exposed by this process.
This document provides an overview of Apache Flink and stream processing. It discusses how stream processing has changed data infrastructure by enabling real-time analysis with low latency. Traditional batch processing had limitations like high latency of hours. Flink allows analyzing streaming data with sub-second latency using mechanisms like windows, state handling, and fault tolerance through distributed snapshots. The document benchmarks Flink performance against other frameworks on a Yahoo! production use case, finding Flink can achieve over 15 million messages/second throughput.
What's new in 1.9.0 blink planner - Kurt Young, AlibabaFlink Forward
Flink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes from Alibaba's internal flink version, called blink. In this talk, I will give a introduction about the architecture of the blink planner, and also share with you the functionalities and performance enhancements we added.
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink Forward
We present a new design pattern for data streaming applications, using Apache Flink and Apache Kafka: Building applications directly on top of the stream processor, rather than on top of key/value databases populated by data streams. Unlike classical setups that use stream processors or libraries to pre-process/aggregate events and update a database with the results, this setup simply gives the role of the database to the stream processor (here Apache Flink), routing queries to its workers who directly answer them from their internal state computed over the log of events (Apache Kafka). This talk will cover both the high-level introduction to the architecture, the techniques in Flink/Kafka that make this approach possible, as well as a live demo.
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...Flink Forward
Apache Flink provides powerful stream processing capabilities which can allow organizations to move directly from batch to real time analytics, skipping the lambda architecture entirely. However, getting to production is not always as simple as rewriting your job in a new API, but requires rethinking your application design with a stream first mindset. This talk will cover MediaMath’s journey in rebuilding its reporting infrastructure using Apache Flink. We will discuss high level architectural designs when building an extensible reporting platform as well as deep dive into specific technical hurdles. Topics will include managing a Flink cluster on EC2 spot instances, reconciling Flink’s consistency model with S3’s, handling massive data skew as well as tools and techniques for building performant, fault tolerant streaming applications.
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward
This document discusses providing an R dataframe abstraction for efficient distributed computation on Apache Flink. The goals are to provide a natural API for R and achieve performance comparable to Flink's native dataflow. The approach represents R dataframes as Flink data sets and compiles R functions into the native execution plan where possible. For user-defined R functions, they are evaluated within worker tasks using a just-in-time compiler. This allows executing R code within the same Java virtual machine as Flink for good performance, even on a single node. Results show it can achieve native Flink performance even for functions containing R code.
Flink 1.0 includes major features such as out of core state using RocksDB, savepoints for upgrading jobs and versions, and a CEP library for pattern detection. Savepoints allow taking snapshots of streaming jobs for code upgrades, Flink version upgrades, and testing through time travel. Flink 1.0 initiates backwards compatibility and pushes production streaming further through these features.
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward
Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics may include queryable state, dynamic scaling, streaming SQL, very large state support, and whatever is the latest and greatest in March 2017.
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...Flink Forward
This talk will focus on how to package, distribute and deploy Flink Jobs by leveraging existing docker technology: Previously deploying of Flink Jobs has been a manual job which leads into errors. In this talk, we present an approach which works well in an CI/CD environment by automating most steps: From the code of a Flink Job in a repository to a running Job on an YARN cluster.
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward
Witnessing the rise of stream processing from the driving seat, we see Apache Flink® and associated technologies used for a wide variety of business applications, from routing data through systems, serving as a backbone for real-time analytics on live data using SQL, detecting credit card fraud, to implementing complete end-to-end social networks. Such applications enable modern data-driven businesses where decisions and actions happen in real-time, and transform traditional businesses to become more data-driven. Observing the variety of these applications implemented using Flink, it becomes apparent that the traditional dividing line between analytics and operational applications is becoming more and more blurry. Historically, operational applications were built using transactional databases, and analytics were done offline. In contrast, Flink’s, state, checkpoints, and time management are the core building blocks for both operational applications with strong data consistency needs, and for real-time analytics with correctness guarantees. With these shared building blocks, developers start building what is arguably a new class of data-driven applications: applications that are operational in that they serve live systems and at the same time analytical in that they perform complex data analysis. Following application architectures like CQRS and using new features like Flink’s queryable state, streaming analytics and online applications move even closer to each other. In this talk, guided by real-world use cases, we present how the unique core concepts behind Flink simplify the development, deployment, and management of data-driven applications, and we conclude with a vision for the future for Flink and stream processing.
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Flink Forward
http://flink-forward.org/kb_sessions/flink-in-genomics-efficient-and-scalable-processing-of-raw-illumina-bcl-data/
A single run in genome sequencing can easily produce several terabytes of data, which subsequently feed a complex pipeline of tools. Typically, the first step in this workflow is a rearrangement of data, roughly equivalent to a matrix transposition, to reconstruct the original DNA fragments from the raw BCL data, where the fragments are sliced and scattered over multiple files. This step is followed by the sorting of the fragments by a specific identifying tag sequence, which is attached during the preparation of the sample. In this talk we will present a parallel program which performs these essential operations. Our BCL converter is shown to have comparable performance to the shared-memory Illumina bcl2fastq tool, while also enabling easy and scalable distributed-memory parallelization. We will describe the techniques we have used to achieve high performance and discuss the features of Flink which we have particularly appreciated as well as the ones which we think are still missing.
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Stream Loops on Flink - Reinventing the wheel for the streaming eraParis Carbone
This document discusses adding iterative processing capabilities to stream processing systems like Apache Flink. It proposes programming model extensions that treat iterative computations as structured loops over windows. Progress would be tracked using progress timestamps rather than watermarks to allow for arbitrary loop structures. Challenges include managing state and cyclic flow control to avoid deadlocks while encouraging iteration completion.
Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...Ververica
Learn how the combination of Apache Kafka and Apache Flink is making stateful stream processing even more expressive and flexible to support applications in streaming that were previously not considered streamable.
The new world of applications and fast data architectures has broken up the database: Raw data persistence comes in the form of event logs, and the state of the world is computed by a stream processor. Apache Kafka provides a strong solution for the event log, while Apache Flink forms a powerful foundation for the computation over the event streams.
In this talk we discuss how Flink’s abstraction and management of application state have evolved over time and how Flink’s snapshot persistence model and Kafka’s log work together to form a base to build ‘versioned applications’. We will also show how end-to-end exactly-once processing works through a smart integration of Kafka’s transactions and Flink’s checkpointing mechanism.
Flink Forward Berlin 2017: Matt Zimmer - Custom, Complex Windows at Scale Usi...Flink Forward
The windowing capabilities offered by most stream processing engines are limited to aligned windows of a fixed duration. However, many real-world event processing use cases don’t fit this rigid structure, resulting in awkward processing pipelines. There haven’t been good alternatives, until recently that is. Apache Flink offers a rich Window API that supports implementing unaligned windows of varying duration. In this talk, Matt Zimmer will discuss using this API at Netflix to aggregate events into windows customized along varying definitions of a session. He will talk about implementation details such as: * Handling out-of-order events * Limiting state build-up while aggregating a subset of events from an event stream * Periodically emitting early results * Creating windows bounded by a type of event Attendees will leave this talk with practical techniques and knowledge to implement their own custom windows in Apache Flink.
In this talk, we describe the design and implementation of the Python Streaming API support that has been submitted for inclusion in mainline Flink. Python is one of the most popular programming languages for data analysis. Its readability emphasizes development productivity and as a scripting language, it does not require a compilation nor complex development environment setup. Flink already has support for Python APIs for batch programming and unfortunately, the mechanism used to support batch programs (i.e., DataSet APIs) do does not work for Streaming API. We describe the limitations with the batch implementation and provide insights into how we solved this using Jython. We will walk through some of the examples programs using the new Python API and compare programmability and performance with the Java and Scala streaming APIs.
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Ververica
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays.
The fact that stream processing is gaining rapid adoption is also due to more powerful and maturing technology (much of it open source at the ASF) that has solved many of the hard technical challenges.
We discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward
Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API. SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined. In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkVerverica
As Apache Flink continues to push the boundaries of stateful stream processing as an integral part of its past releases, increasing numbers of users are starting to realize the potential of stateful stream processing as a promising paradigm for robust and reactive data analytics as well as event-driven applications.
This talk aims at covering the general idea and motivations of stateful stream processing, and how Flink enables it with its powerful set of state management features and programming APIs. In addition to that, we will also take a look at the recent advancements related to Flink's state management and large state handling that were driven by our team at data Artisans team in the latest version 1.3 (expected release by end of May / early June).
This document discusses the C100K problem of handling 100,000 concurrent network connections efficiently and describes how the Go programming language solves this problem. It explains that Go uses lightweight goroutines instead of OS threads, has a fast scheduler, and uses non-blocking I/O with epoll to efficiently handle a large number of clients with a small memory footprint on each CPU core. An example TCP/HTTP server is shown to demonstrate how Go implements networking.
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
Apache Flink, a powerful distributed stateful stream processing framework, is an especially good fit for deployment on a containerization platform: its storage requirement is primarily external (e.g. HDFS or S3), clusters often share the lifetime of the jobs that run on them, and the flexibility of allocating resources on such a platform allows for scaling jobs up and down as necessary. In this talk I will give a brief introduction to Apache Flink, then describe the journey to making it a first-class citizen of the container world. I will cover my experience preparing to publish the “official repository” of Flink images on Docker Hub, the challenges of fitting a Flink deployment in a Kubernetes-shaped box, and the rough edges of Flink itself that were exposed by this process.
This document provides an overview of Apache Flink and stream processing. It discusses how stream processing has changed data infrastructure by enabling real-time analysis with low latency. Traditional batch processing had limitations like high latency of hours. Flink allows analyzing streaming data with sub-second latency using mechanisms like windows, state handling, and fault tolerance through distributed snapshots. The document benchmarks Flink performance against other frameworks on a Yahoo! production use case, finding Flink can achieve over 15 million messages/second throughput.
What's new in 1.9.0 blink planner - Kurt Young, AlibabaFlink Forward
Flink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes from Alibaba's internal flink version, called blink. In this talk, I will give a introduction about the architecture of the blink planner, and also share with you the functionalities and performance enhancements we added.
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink Forward
We present a new design pattern for data streaming applications, using Apache Flink and Apache Kafka: Building applications directly on top of the stream processor, rather than on top of key/value databases populated by data streams. Unlike classical setups that use stream processors or libraries to pre-process/aggregate events and update a database with the results, this setup simply gives the role of the database to the stream processor (here Apache Flink), routing queries to its workers who directly answer them from their internal state computed over the log of events (Apache Kafka). This talk will cover both the high-level introduction to the architecture, the techniques in Flink/Kafka that make this approach possible, as well as a live demo.
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...Flink Forward
Apache Flink provides powerful stream processing capabilities which can allow organizations to move directly from batch to real time analytics, skipping the lambda architecture entirely. However, getting to production is not always as simple as rewriting your job in a new API, but requires rethinking your application design with a stream first mindset. This talk will cover MediaMath’s journey in rebuilding its reporting infrastructure using Apache Flink. We will discuss high level architectural designs when building an extensible reporting platform as well as deep dive into specific technical hurdles. Topics will include managing a Flink cluster on EC2 spot instances, reconciling Flink’s consistency model with S3’s, handling massive data skew as well as tools and techniques for building performant, fault tolerant streaming applications.
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward
This document discusses providing an R dataframe abstraction for efficient distributed computation on Apache Flink. The goals are to provide a natural API for R and achieve performance comparable to Flink's native dataflow. The approach represents R dataframes as Flink data sets and compiles R functions into the native execution plan where possible. For user-defined R functions, they are evaluated within worker tasks using a just-in-time compiler. This allows executing R code within the same Java virtual machine as Flink for good performance, even on a single node. Results show it can achieve native Flink performance even for functions containing R code.
Flink 1.0 includes major features such as out of core state using RocksDB, savepoints for upgrading jobs and versions, and a CEP library for pattern detection. Savepoints allow taking snapshots of streaming jobs for code upgrades, Flink version upgrades, and testing through time travel. Flink 1.0 initiates backwards compatibility and pushes production streaming further through these features.
These are the slides that supported the presentation on Apache Flink at the ApacheCon Budapest.
Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Delivering User Behavior Analytics at Apache Hadoop Scale : A new perspective...Cloudera, Inc.
Learn how to:
* Detect threats automatically and accurately
* Reduce threat response times from 7 days to 4 hour
* Ingest and process 100+TB per day for automated machine learning and behavior-based detection
This document provides an overview of Apache Flink, an open-source platform for distributed stream and batch data processing. Flink allows for unified batch and stream processing with a simple yet powerful programming model. It features native stream processing, exactly-once fault tolerance based on consistent snapshots, and high performance optimized for streaming workloads. The document outlines Flink's APIs, state management, fault tolerance approach, and roadmap for continued improvements in 2015.
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
This document discusses processing scientific mass spectrometry data in real-time using parallel and distributed computing techniques. It describes how a mass spectrometry experiment produces terabytes of data that currently takes over 24 hours to fully process. The document proposes using MapReduce and Apache Flink to parallelize the data processing across clusters to help speed it up towards real-time analysis. Initial tests show Flink can process the data 2-3 times faster than traditional Hadoop MapReduce. Finally, it discusses simulating real-time streaming of the data using Kafka and Flink Streaming to enable processing results within 10 seconds of the experiment completing.
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
Flink can handle many data types and provides a type system to identify types for serialization and comparisons. Composite types like Tuples and POJOs can be used and fields within them can define keys. Windows provide a way to perform aggregations over finite slices of infinite streams. Connected streams allow correlating and joining multiple streams. Stateful functions have access to local and partitioned state for stateful stream processing. Kafka integration allows consuming from and producing to Kafka topics.
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkFlink Forward
Suneel Marthi gave a talk about BigPetStore, a blueprint for Apache Flink applications that uses synthetic data generators. BigPetStore includes data generators, examples using tools like MapReduce, Spark and Flink to process the generated data, and tests for integration. It is used for templates, education, testing, demos and benchmarking. The talk outlined the history and components of BigPetStore and described upcoming work to expand it for Flink, including batch and table API examples and machine learning algorithms.
Capital One is a large consumer and commercial bank that wanted to improve its real-time monitoring of customer activity data to detect and resolve issues quickly. Its legacy solution was expensive, proprietary, and lacked real-time and advanced analytics capabilities. Capital One implemented a new solution using Apache Flink for its real-time stream processing abilities. Flink provided cost-effective, real-time event processing and advanced analytics on data streams to help meet Capital One's goals. It also aligned with the company's technology strategy of using open source solutions.
Timing is Everything: Understanding Event-Time Processing in Flink SQLHostedbyConfluent
"In the stream processing context, event-time processing means the events are processed based on when the events occurred, rather than when the events are observed (processing-time) in the system. Apache Flink has a powerful framework for event-time processing, which plays a pivotal role in ensuring temporal order and result accuracy.
In this talk, we will introduce Flink event-time semantics and demonstrate how watermarks as a means of handling late-arriving events are generated, propagated, and triggered using Flink SQL. We will explore operators such as window and join that are often used with event time processing, and how different configurations can impact the processing speed, cost and correctness.
Join us for this exploration where event-time theory meets practical SQL implementation, providing you with the tools to make informed decisions for making optimal trade-offs."
This document provides an overview of Apache Flink, an open-source framework for distributed stream and batch data processing. It discusses key aspects of Flink including that it executes everything as data streams, supports iterative and cyclic data flows, allows mutable state in operators, and provides high availability and checkpointing of operator state. It also provides examples of using Flink's DataStream API to perform operations like hourly and daily tweet impression counts on a continuous stream of tweet data from Kafka.
Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention
Apache Flink is an open source platform for distributed stream and batch data processing. At its core, Flink is a streaming dataflow engine which provides data distribution, communication, and fault tolerance for distributed computations over data streams. On top of this core, APIs make it easy to develop distributed data analysis programs. Libraries for graph processing or machine learning provide convenient abstractions for solving large-scale problems. Apache Flink integrates with a multitude of other open source systems like Hadoop, databases, or message queues. Its streaming capabilities make it a perfect fit for traditional batch processing as well as state of the art stream processing.
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...Ververica
This document discusses Apache Flink and how it enables accurate analytics for Internet of Things (IoT) applications through stateful event-time stream processing. It begins by defining IoT and event-time stream processing, explaining that IoT data is continuously generated and has timestamps. It then discusses challenges like time mismatches between event time and processing time. The document also covers Flink's capabilities for stateful stream processing including failure handling through checkpoints, updating applications using savepoints, and high availability of the JobManager. It positions Flink as a stateful stream processor well-suited for IoT use cases.
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
This document discusses Apache Flink for IoT event-time stream processing. It begins by introducing streaming architectures and Flink. It then discusses how IoT data has important properties like continuous data production and event timestamps that require event-time based processing. Examples are provided of companies like King and Bouygues Telecom using Flink for billions of events per day with challenges like out-of-order data and flexible windowing. Event-time processing in Flink is able to handle these challenges through features like watermarks.
Understanding time in structured streamingdatamantra
This document discusses time abstractions in structured streaming. It introduces process time, event time, and ingestion time. It explains how to use the window API to apply windows over these different time abstractions. It also discusses handling late events using watermarks and implementing non-time based windows using custom state management and sessionization.
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
Apache Flink is a distributed, stateful stream processor. It features exactly-once state consistency, sophisticated event-time support, high throughput and low latency processing, and APIs at different levels of abstraction (Java, Scala, SQL). In my talk, I'll give an introduction to Apache Flink, its features and discuss the use cases it solves. I'll explain why batch is just a special case of stream processing, how its community evolves Flink into a truly unified stream and batch processor and what this means for its users.
https://www.meetup.com/de-DE/Bangalore-Apache-Kafka-Group/events/265285812/
https://www.youtube.com/watch?v=Ych5bbmDIoA&list=PLvkUPePDi9sa27SG9eGNXH25cfUeo_WY9&index=2
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...Flink Forward
The streaming platform team at Lyft has been running Flink jobs in production for more than a year now, powering critical use cases like improving pickup ETA accuracy, dynamic pricing, generating machine learning features for fraud detection, real-time analytics among many others. Broadly, the jobs fall into two abstraction layers: applications (Flink jobs that run on the native platform) and analytics (that leverage Dryft, Lyft’s fully managed data processing engine). This talk will give an overview of the platform architecture, deployment model and user experience. The talk will also dive deeper into some of the challenges and the lessons that were learnt, running Flink jobs at scale, specifically around scaling Flink connectors, dealing with event time skew (source synchronization) and highlight common patterns of problems observed across several Flink jobs. Finally, the talk will give insights into how we are re-architecting the streaming platform @ Lyft using a Kubernetes based deployment.
Introduction to Stateful Stream Processing with Apache Flink.Konstantinos Kloudas
Kostas Kloudas presented on stateful stream processing with Apache Flink. He discussed how Flink handles state management, fault tolerance, and time semantics to allow for continuous and accurate processing of streaming data. Flink embeds local state with keyed streams, takes consistent snapshots of distributed state, and uses watermarks to process events in event time to produce correct results even for out-of-order data. This allows Flink to provide a robust stream processing engine that scales to large deployments.
When Streaming Needs Batch With Konstantin Knauf | Current 2022HostedbyConfluent
When Streaming Needs Batch With Konstantin Knauf | Current 2022
A streaming application is started once and then continuously ingests endless, fairly steady streams of events. That's as far as the theory goes.
Unfortunately, reality is more complicated. Over time your application's ability to process large historical data sets robustly, efficiently and correctly will be critical:
- for exploratory data analysis during development
- for bootstrapping the initial state of an application
- for back-filling following an outage or bugfix
- for keeping up with bursty input streams
These scenarios call for batch processing techniques. Apache Flink is as streaming-first as it gets. Yet over the last releases, the community has invested significant resources into unifying stream- and batch processing on all layers of the stack: scheduler to APIs.
In this talk, I'll introduce Apache Flink's approach to unified stream and batch processing and discuss - by example - how these scenarios can already be addressed today and what might be possible in the future.
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...HostedbyConfluent
This document discusses Flink's connector ecosystem and how to get data in and out of Flink. It describes Flink's layered APIs, including Flink SQL, Table API, DataStream API, and ProcessFunction. It also covers Flink's source and sink interfaces, including the unified source and sink interfaces, hybrid sources, and watermark alignment. The document provides guidance on when to write custom connectors versus leveraging Flink's async I/O capabilities. It notes that Source- and SinkFunction are being deprecated and encourages contributing to existing connectors or building new ones.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1VhSzmy.
Robert Metzger provides an overview of the Apache Flink internals and its streaming-first philosophy, as well as the programming APIs. Filmed at qconlondon.com.
Robert Metzger is a PMC member at the Apache Flink project and a cofounder and software engineer at data Artisans. He is the author of many Flink components including the Kafka and YARN connectors.
El día 21 de Septiembre, tuvimos el placer de acoger en nuestras oficinas un Meetup impartido por nuestro compañero Paco Guerrero sobre la plataforma Apache Flink.
"Apache Flink es una plataforma open source de procesamiento en tiempo real, que está en auge al ofrecer características de las que otras tecnologías con las que compite no disponen, sin impacto en su rendimiento. En esta formación introduciremos la filosofía y motor de procesamiento que hace a Flink tan especial y potente. También recorreremos los pilares básicos que confirman a Flink como la plataforma de streaming más prometedora actualmente"
The upcoming Apache Flink 0.10 release will include features such as high availability of the JobManager through Zookeeper, live monitoring of accumulators and metrics, improved event-time and windowing capabilities using watermarks, and exactly-once fault tolerance through distributed snapshots. A demo will also show how fault tolerance works to ensure state consistency during failures. More improvements are still being worked on for this release.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
Data Stream Processing with Apache FlinkFabian Hueske
This talk is an introduction into Stream Processing with Apache Flink. I gave this talk at the Madrid Apache Flink Meetup at February 25th, 2016.
The talk discusses Flink's features, shows it's DataStream API and explains the benefits of Event-time stream processing. It gives an outlook on some features that will be added after the 1.0 release.
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Ververica
Back to the program
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
Thursday 17th
from 18:00 to 18:40
Theatre 19
-
Keynote
In this talk I’ll give a very short introduction to stream processing in general and then dive into event-time based stream processing. I will outline how this is important for IoT applications and also why it is such a challenging topic. Afterwards we’ll look at some real-world IoT use cases that are enabled by the support for robust event-time based stream processing provided by Apache Flink™. We will especially focus on easy of use and on correctness of results in the face of errors.
In the first half of the talk we’ll cover the basics of stream processing. We will look at the differences between event-time based and processing-time and at stateful stream processing. While on this, we’ll also highlight how the combination of these features is essential for doing robust stream processing in an IoT setting.
In the second part, we will look at how Flink solves some of the challenges that arise in event-time based processing and how that enables novel applications in the IoT space. We will do the latter by looking at a collection of real-world IoT use cases.
Some of the topics covered will be:
- Apache Flink
- Stateful Stream Processing
- Event Time vs. Processing Time Windowing
- Processing of out-of-order events
- IoT use cases
In this talk about Apache Flink we will touch on three main things, an introductory look at Flink, a look under the hood and a demo.
* In the introduction we will briefly look at the history of Flink and then go on to the API and different use cases. Here we will also see how it can be deployed in practice and what some of the pitfalls in a cluster setting can be.
* In the second section we will look at the streaming execution engine that lies at the heart of Flink. Here we will see what makes it tick and also what distinguishes it from other approaches, such as the mini-batch execution model.
Ufuk Celebi - PMC member at Apache Flink and co-founder and software engineer at data Artisans
* In the final section we will see a live demo of a fault-tolerant streaming job that performs analysis of the wikipedia edit-stream.
Similar to Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing (20)
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
5. ● Flink’s notion of time in streaming jobs
● How Watermarks support Event-Time Processing
● Flink’s fault-tolerant, exactly-once streaming semantics
● Flink’s distributed snapshot checkpointing
● Out-of-core streaming state backends
00 This session will be about ...
4
7. ● Processing Time:
○ The timestamp at which a system processes an event
○ “Wall Time”
● Ingestion Time:
○ The timestamp at which a system receives an event
○ “Wall Time”
● Event Time:
○ The timestamp at which an event is generated
01 Different Kinds of “Time”
6
9. 02 Why Wall Time is Incorrect
8
● Think Twitter hash-tag count every 5 minutes
○ We would want the result to reflect the number of
Twitter tweets actually tweeted in a 5 minute
window
○ Not the number of tweet events the stream
processor receives within 5 minutes
10. 02 Why Wall Time is Incorrect
9
● Think replaying a Kafka topic on a windowed
streaming application …
○ If you’re replaying a queue, windows are
definitely wrong if using a wall clock
11. 03 Watermarks & Event-Time
10
● Watermarks is a way to let Flink monitor the
progress of event time
● Essentially a record that flows within the data stream
● Watermarks carry a timestamp t. When a task
receives a t watermark, it knows that there will be no
more events with timestamp t’ < t
16. 07 Stateful Streaming
15
● Any non-trivial streaming application is stateful
● To draw insights from a stream you usually need to
look beyond a single record
● Any kind of aggregation is stateful (ex. windows)
17. 08 What “state” looks like in Flink
16
● Any Flink task can be
stateful
● State is partitioned
with the streams that
are read by stateful
tasks
18. 09 Distributed Snapshots
17
● On each checkpoint trigger,
task managers tell all
stateful tasks that they
manage to snapshot their
own state
● When complete, send
checkpoint
acknowledgement to
JobManager
● Chandy Lamport Distributed
Snapshot Algorithm
19. 09 Distributed Snapshots
18
● On a checkpoint trigger by the JobManager, a
checkpoint barrier is injected into the stream
20. 10 Distributed Snapshots
19
● When a task
receives a
checkpoint barrier,
its state is
checkpointed to a
state backend
● A pointer value to
the stored state is
stored in the
distributed snapshot
21. 11 Distributed Snapshots
20
● After all stateful tasks
acknowledges, the
distributed snapshot
is completed
● Only fully completed
snapshots are used
for restore on failure
22. 12 Checkpointing API
21
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(100) // trigger checkpoint every 100ms
env.setStateBackend(new RocksDBStateBackend(...))
23. 13 Flink Streaming Savepoints
22
● Basically, a checkpointed that is persisted in the state backend
● Allows for stream progress “versioning”
24. 14 Power of Savepoints
23
● No stateless point in time
25. 14 Power of Savepoints
24
● Reprocessing as batch
26. 14 Power of Savepoints
24
● Reprocessing as batch (corrupt state)
27. 14 Power of Savepoints
25
● Reprocessing as streaming, starting from savepoint
28. 15 Power of Savepoints
26
● Reprocessing as streaming, starting from savepoint