This document provides an overview of a presentation comparing Apache Flink and Apache Spark. The presentation aims to address marketing claims, confusing statements, and outdated information regarding Flink vs Spark. It outlines key criteria to evaluate the two platforms, such as streaming capabilities, state management, and scalability. The document then directly compares some criteria, such as their support for iterative processing and streaming engines. The presenter hopes this evaluation framework will help others assess Flink and Spark for stream processing use cases.
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we have observed that the scale of state that is managed by Flink in production is constantly growing. This development created new challenges for state management in Flink, in particular for state checkpointing, which is the core of Flink's fault tolerance mechanism. Two of the most important problems that we had to solve were the following: (i) how can we limit the duration and size of checkpoints to something that does not grow linearly in the size of the state and (ii) how can we take checkpoints without blocking the processing pipeline in the meantime? We have implemented incremental checkpoints to solve the first problem by checkpointing only the changes between checkpoints, instead of always recording the whole state. Asynchronous checkpoints address the second problem and enable Flink to continue processing concurrently to running checkpoints. In this talk, we will take a deep dive into the details of Flink's new checkpointing features. In particular, we will talk about the underlying datastructures, log-structured merge trees and copy-on-write hash tables, and how those building blocks are assembled and orchestrated to advance Flink's checkpointing.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
http://flink-forward.org/kb_sessions/scaling-stream-processing-with-apache-flink-to-very-large-state/
The majority of streaming programs is ‘stateful’: Windowed Aggregations, Sessions, Joins, Complex Event Processing, Tables – they all require to keep some form of state across individual events. With the migration of more and more complex batch jobs or data processing pipelines to streaming applications, some streaming programs need to keep terabytes of state. Apache Flink implements a checkpointing-based recovery mechanism that guarantees exactly-once semantics for state also in the presence of failures. The cost of checkpointing and recovery depends on the size of the program’s state. In this talk, we will discuss the current status of stateful processing in Apache Flink, as well as the ongoing efforts to make Flink’s fault tolerance mechanism scale to very large state sizes, supporting frequent checkpoints and faster recovery of large state, without requiring excessive numbers of machines.
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we have observed that the scale of state that is managed by Flink in production is constantly growing. This development created new challenges for state management in Flink, in particular for state checkpointing, which is the core of Flink's fault tolerance mechanism. Two of the most important problems that we had to solve were the following: (i) how can we limit the duration and size of checkpoints to something that does not grow linearly in the size of the state and (ii) how can we take checkpoints without blocking the processing pipeline in the meantime? We have implemented incremental checkpoints to solve the first problem by checkpointing only the changes between checkpoints, instead of always recording the whole state. Asynchronous checkpoints address the second problem and enable Flink to continue processing concurrently to running checkpoints. In this talk, we will take a deep dive into the details of Flink's new checkpointing features. In particular, we will talk about the underlying datastructures, log-structured merge trees and copy-on-write hash tables, and how those building blocks are assembled and orchestrated to advance Flink's checkpointing.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
http://flink-forward.org/kb_sessions/scaling-stream-processing-with-apache-flink-to-very-large-state/
The majority of streaming programs is ‘stateful’: Windowed Aggregations, Sessions, Joins, Complex Event Processing, Tables – they all require to keep some form of state across individual events. With the migration of more and more complex batch jobs or data processing pipelines to streaming applications, some streaming programs need to keep terabytes of state. Apache Flink implements a checkpointing-based recovery mechanism that guarantees exactly-once semantics for state also in the presence of failures. The cost of checkpointing and recovery depends on the size of the program’s state. In this talk, we will discuss the current status of stateful processing in Apache Flink, as well as the ongoing efforts to make Flink’s fault tolerance mechanism scale to very large state sizes, supporting frequent checkpoints and faster recovery of large state, without requiring excessive numbers of machines.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteStreamNative
In this talk, Till Rohrmann and Addison Higham discuss how Flink allows for ambitious stream processing workflows and how Pulsar and Flink enable new capabilities that push forward the state-of-the-art in streaming. They will also share upcoming features and new capabilities in the integrations between Flink and Pulsar and how these two communities are working together to truly advance the power of stream processing.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
Netflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a critical dataset with high volume that is used broadly across Netflix, powering product experiences, AB test metrics, and offline insights. In processing playback data, we depend heavily on event-time partitioning to handle a long tail of late arriving events. In this talk, I’ll provide an overview of our recent implementation of generic event-time partitioning on high volume streams using Apache Flink and Apache Iceberg (Incubating). Built as configurable Flink components that leverage Iceberg as a new output table format, we are now able to write playback data and other large scale datasets directly from a stream into a table partitioned on event time, replacing the common pattern of relying on a post-processing batch job that “puts the data in the right place”. We’ll talk through what it took to apply this to our playback data in practice, as well as challenges we hit along the way and tradeoffs with a streaming approach to event-time partitioning.
Best Practices for Middleware and Integration Architecture Modernization with...Claus Ibsen
What are important considerations when modernizing middleware and moving towards serverless and/or cloud native integration architectures? How can we make the most of flexible technologies such as Camel K, Kafka, Quarkus and OpenShift. Claus is working as project lead on Apache Camel and has extensive experience from open source product development.
The talk was recorded and runs for 30 minutes and published on youtube at: https://www.youtube.com/watch?v=d1Hr78a7Lww
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
Join Tom Green, Solution Engineer at Confluent for this Lunch and Learn talk covering KSQL. Confluent KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka®. It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and it supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
By attending one of these sessions, you will learn:
-How to query streams, using SQL, without writing code.
-How KSQL provides automated scalability and out-of-the-box high availability for streaming queries
-How KSQL can be used to join streams of data from different sources
-The differences between Streams and Tables in Apache Kafka
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
An Approach to Data Quality for Netflix Personalization SystemsDatabricks
Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests.
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
Search and recommendation system for Alibaba’s e-commerce platform use batch and streaming processing heavily. Flink SQL and Table API (which is a SQL-like DSL) provide simple, flexible, and powerful language to express the data processing logic. More importantly, it opens the door to unify the semantics of batch and streaming jobs.
Blink is a project at Alibaba which improves Apache Flink to make it ready for large scale production use. To support our products, we made lots of improvements to Flink SQL & TableAPI in Alibaba's Blink project. We added the support for User-Defined Table function (UDTF), User-Defined Aggregates (UDAGG), Window Aggregate, and retraction, etc. We are actively working with the Flink community to contribute these improvements back. In this talk, we will present the rationale, semantics, design and implementation of these improvements. We will also share the experience of running large scale Flink SQL and TableAPI jobs at Alibaba.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
Speaker: Matthew Powers
2 hour session where I cover what is Apache Camel, latest news on the upcoming Camel v3, and then the main topic of the talk is the new Camel K sub-project for running integrations natively on the cloud with kubernetes. The last part of the talk is about running Camel with GraalVM / Quarkus to archive native compiled binaries that has impressive startup and footprint.
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteStreamNative
In this talk, Till Rohrmann and Addison Higham discuss how Flink allows for ambitious stream processing workflows and how Pulsar and Flink enable new capabilities that push forward the state-of-the-art in streaming. They will also share upcoming features and new capabilities in the integrations between Flink and Pulsar and how these two communities are working together to truly advance the power of stream processing.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
Netflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a critical dataset with high volume that is used broadly across Netflix, powering product experiences, AB test metrics, and offline insights. In processing playback data, we depend heavily on event-time partitioning to handle a long tail of late arriving events. In this talk, I’ll provide an overview of our recent implementation of generic event-time partitioning on high volume streams using Apache Flink and Apache Iceberg (Incubating). Built as configurable Flink components that leverage Iceberg as a new output table format, we are now able to write playback data and other large scale datasets directly from a stream into a table partitioned on event time, replacing the common pattern of relying on a post-processing batch job that “puts the data in the right place”. We’ll talk through what it took to apply this to our playback data in practice, as well as challenges we hit along the way and tradeoffs with a streaming approach to event-time partitioning.
Best Practices for Middleware and Integration Architecture Modernization with...Claus Ibsen
What are important considerations when modernizing middleware and moving towards serverless and/or cloud native integration architectures? How can we make the most of flexible technologies such as Camel K, Kafka, Quarkus and OpenShift. Claus is working as project lead on Apache Camel and has extensive experience from open source product development.
The talk was recorded and runs for 30 minutes and published on youtube at: https://www.youtube.com/watch?v=d1Hr78a7Lww
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
Join Tom Green, Solution Engineer at Confluent for this Lunch and Learn talk covering KSQL. Confluent KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka®. It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and it supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
By attending one of these sessions, you will learn:
-How to query streams, using SQL, without writing code.
-How KSQL provides automated scalability and out-of-the-box high availability for streaming queries
-How KSQL can be used to join streams of data from different sources
-The differences between Streams and Tables in Apache Kafka
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
An Approach to Data Quality for Netflix Personalization SystemsDatabricks
Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests.
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
Search and recommendation system for Alibaba’s e-commerce platform use batch and streaming processing heavily. Flink SQL and Table API (which is a SQL-like DSL) provide simple, flexible, and powerful language to express the data processing logic. More importantly, it opens the door to unify the semantics of batch and streaming jobs.
Blink is a project at Alibaba which improves Apache Flink to make it ready for large scale production use. To support our products, we made lots of improvements to Flink SQL & TableAPI in Alibaba's Blink project. We added the support for User-Defined Table function (UDTF), User-Defined Aggregates (UDAGG), Window Aggregate, and retraction, etc. We are actively working with the Flink community to contribute these improvements back. In this talk, we will present the rationale, semantics, design and implementation of these improvements. We will also share the experience of running large scale Flink SQL and TableAPI jobs at Alibaba.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
Speaker: Matthew Powers
2 hour session where I cover what is Apache Camel, latest news on the upcoming Camel v3, and then the main topic of the talk is the new Camel K sub-project for running integrations natively on the cloud with kubernetes. The last part of the talk is about running Camel with GraalVM / Quarkus to archive native compiled binaries that has impressive startup and footprint.
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Stream Processing with CompletableFuture and Flow in Java 9Trayan Iliev
Stream based data / event / message processing becomes preferred way of achieving interoperability and real-time communication in distributed SOA / microservice / database architectures.
Beside lambdas, Java 8 introduced two new APIs explicitly dealing with stream data processing:
- Stream - which is PULL-based and easily parallelizable;
- CompletableFuture / CompletionStage - which allow composition of PUSH-based, non-blocking, asynchronous data processing pipelines.
Java 9 will provide further support for stream-based data-processing by extending the CompletableFuture with additional functionality – support for delays and timeouts, better support for subclassing, and new utility methods.
More, Java 9 provides new java.util.concurrent.Flow API implementing Reactive Streams specification that enables reactive programming and interoperability with libraries like Reactor, RxJava, RabbitMQ, Vert.x, Ratpack, and Akka.
The presentation will discuss the novelties in Java 8 and Java 9 supporting stream data processing, describing the APIs, models and practical details of asynchronous pipeline implementation, error handling, multithreaded execution, asyncronous REST service implementation, interoperability with existing libraries.
There are provided demo examples (code on GitHub) using Completable Future and Flow with:
- JAX-RS 2.1 AsyncResponse, and more importantly unit-testing the async REST service method implementations;
- CDI 2.0 asynchronous observers (fireAsync / @ObservesAsync);
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
Los Angeles Apache Spark Users Group 2014-12-11 http://meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/218748643/
A look ahead at Spark Streaming in Spark 1.2 and beyond, with case studies, demos, plus an overview of approximation algorithms that are useful for real-time analytics.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
A noETL Parallel Streaming Transformation Loader using Spark, Kafka & VerticaData Con LA
ETL, ELT and Lambda architectures have evolved into a [non]Streaming general purpose data ingestion pipeline, that is scalable through distributed processing, for Big Data Analytics over hybrid Data Warehouses in Hadoop and MPP Columnar stores like HPE-Vertica.
Bio: Jack Gudenkauf (https://www.linkedin.com/in/jackglinkedin) has over twenty-nine years of experience designing and implementing Internet scale distributed systems. Jack is currently the CEO & Founder of the startup BigDataInfra. He was previously; VP of Big Data at Playtika, a hands-on manager of the Twitter Analytics Data Warehouse team, spent 15 years at Microsoft shipping 15 products, and prior to Microsoft he managed his own consulting company after he began his career as an MIS Director of several startup companies.
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
Tinder’s Quickfire Pipeline powers all things data at Tinder. It was originally built using AWS Kinesis Firehoses and has since been extended to use both Kafka and other event buses. It is the core of Tinder’s data infrastructure. This rich data flow of both client and backend data has been extended to service a variety of needs at Tinder, including Experimentation, ML, CRM, and Observability, allowing backend developers easier access to shared client side data. We perform this using many systems, including Kafka, Spark, Flink, Kubernetes, and Prometheus. Many of Tinder’s systems were natively designed in an RPC first architecture.
Things we’ll discuss decoupling your system at scale via event-driven architectures include:
– Powering ML, backend, observability, and analytical applications at scale, including an end to end walk through of our processes that allow non-programmers to write and deploy event-driven data flows.
– Show end to end the usage of dynamic event processing that creates other stream processes, via a dynamic control plane topology pattern and broadcasted state pattern
– How to manage the unavailability of cached data that would normally come from repeated API calls for data that’s being backfilled into Kafka, all online! (and why this is not necessarily a “good” idea)
– Integrating common OSS frameworks and libraries like Kafka Streams, Flink, Spark and friends to encourage the best design patterns for developers coming from traditional service oriented architectures, including pitfalls and lessons learned along the way.
– Why and how to avoid overloading microservices with excessive RPC calls from event-driven streaming systems
– Best practices in common data flow patterns, such as shared state via RocksDB + Kafka Streams as well as the complementary tools in the Apache Ecosystem.
– The simplicity and power of streaming SQL with microservices
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Flink Forward San Francisco 2022.
At Flink Forward, we get to hear creative, unique use cases, often on the bleeding edge of some of the most exciting current technologies. This talk will give you a chance to get to open up the hood on our driven and innovative Open Source community. I will cover what our community has been working on this past year, and how this work relates to our (Ververica's) exciting new Flink engineering roadmap! I will also go through some best practices and upcoming opportunities for getting involved in this community!
by
Caito Scherr
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Using Queryable State for Fun and ProfitFlink Forward
Flink Forward San Francisco 2022.
A particular feature in our system relies on a streaming 90-minute trailing window of 1-minute samples - implemented as a lookaside cache - to speed up a particular query, allowing our customers to rapidly see an overview of their estate. Across our entire customer base, there is a substantial amount of data flowing into this cache - ~1,000,000 entries/second, with the entire cache requiring ~600GB of RAM. The current implementation is simplistic but expensive. In this talk I describe a replacement implementation as a stateful streaming Flink application leveraging Queryable State. This Flink application reduces the net cost by ~90%. In this session, the implementation is described in detail, including windowing considerations, a sliding-window state buffer that avoids the sliding window replication penalty, and a comparison of queryable state and Redis queries. The talk concludes with a frank discussion of when this distinctive approach is, and is not, appropriate.
by
Ron Crocker
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
6. 6
2. Confusing statements
“Spark is already an excellent piece of software and is
advancing very quickly. No vendor — no new project —
is likely to catch up. Chasing Spark would be a waste of
time, and would delay availability of real-time analytic
and processing services for no good reason.” Source:
MapReduce and Spark, Mike Olson. Chief Strategy
Officer, Cloudera. December, 30th 2013
http://vision.cloudera.com/mapreduce-spark/
“Goal: one engine for all data sources, workloads and
environments.” Source: Slide 15 of ‘New Directions for
Apache Spark in 2015’, Matei Zaharia. CTO, Databricks.
February 20th , 2015. http://www.slideshare.net/databricks/new-directions-for-
apache-spark-in-2015
7. 7
3. Burning questions & incorrect or
outdated answers
"Projects that depend on smart optimizers rarely work
well in real life.” Curt Monash, Monash Research.
January 16, 2015http://www.computerworld.com/article/2871760/big-data-
digest-how-many-hadoops-do-we-really-need.html
“Flink is basically a Spark alternative out of Germany,
which I’ve been dismissing as unneeded”. Curt
Monash, Monash Research, March 5, 2015.
http://www.dbms2.com/2015/03/05/cask-and-cdap/
“Of course, this is all a bullish argument for Spark (or
Flink, if I’m wrong to dismiss its chances as a Spark
competitor).” Curt Monash, Monash Research,
September 28, 2015. http://www.dbms2.com/2015/09/28/the-potential-
significance-of-cloudera-kudu/
8. 8
3. Burning questions & incorrect or
outdated answers
“The benefit of Spark's micro-batch model is that you
get full fault-tolerance and "exactly-once" processing
for the entire computation, meaning it can recover all
state and results even if a node crashes. Flink and
Storm don't provide this…” Matei Zaharia. CTO,
Databricks. May 2015http://www.kdnuggets.com/2015/05/interview-matei-
zaharia-creator-apache-spark.html
“I understand Spark Streaming uses micro-batching.
Does this increase latency? While Spark does use a
micro-batch execution model, this does not have
much impact on applications…” http://spark.apache.org/faq.html
9. 9
4. Help others evaluating Flink vs. Spark
Besides the marketing fluff, the confusing statements,
the incorrect or outdated answers to burning
questions, the little information on the subject of Flink
vs. Spark is available piecemeal!
While evaluating different stream processing tools at
Capital One, we built a framework listing categories
and over 100 criteria to assess these stream
processing tools.
In the next section, I’ll be sharing this framework and
use it to compare Spark and Flink on a few key
criteria.
We hope this will be beneficial to you as well when
selecting Flink and/or Spark for stream processing.
10. 10
Agenda
I. Motivation for this talk
II. Apache Flink vs. Apache Spark?
III. How Flink is used at Capital One?
IV. What are some key takeaways?
11. 11
II. Apache Flink vs. Apache Spark?
1. What is Apache Flink?
2. What is Apache Spark?
3. Framework to evaluate Flink and Spark
4. Flink vs. Spark on a few key criteria
5. Future work
12. 12
1.What is Apache Flink?
Squirrel: Animal. In harmony with other animals in the
Hadoop ecosystem (Zoo): elephant, pig, python,
camel,...
Squirrel: reflects the meaning of the word ‘Flink’:
German for “nimble, swift, speedy” which are also
characteristics of the squirrel.
Red color. In harmony with red squirrels in Germany to
reflect its root at German universities
Tail: colors matching the ones of the feather
symbolizing the Apache Software Foundation.
Commitment to build Flink in the open source!?
13. 13
1.What is Apache Flink?
“Apache Flink is an open source platform for
distributed stream and batch data processing.”
https://flink.apache.org/
See also the definition in
Wikipedia:https://en.wikipedia.org/wiki/Apache_Flink
15. 15
2. What is Apache Spark?
“Apache Spark™ is a fast and general engine for
large-scale data processing.”
http://spark.apache.org/
See also definition in Wikipedia:
https://en.wikipedia.org/wiki/Apache_Spark
Logo was picked to reflect Lightning-fast cluster
computing
19. 19
2. Fit-for-purpose Categories
2.1 Security
2.2 Provisioning & Monitoring Capabilities
2.3 Latency & Processing Architecture
2.4 State Management
2.5 Processing Delivery Assurance
2.6 Database Integrations, Native vs. Third
party connector
2.7 High Availability & Resiliency
2.8 Ease of Development
2.9 Scalability
2.10 Unique Capabilities/Key Differentiators
20. 20
2. Fit-for-purpose Categories
2.1 Security
2.1.1 Authentication, Authorization
2.1.2 Data at rest encryption (data persisted in the
framework)
2.1.3 Data in motion encryption (producer -
>framework -> consumer)
2.1.4 Data in motion encryption (inter-node
communication)
21. 21
2. Fit-for-purpose Categories
2.2 Provisioning & Monitoring Capabilities
2.2.1 Robustness of Administration
2.2.2 Ease of maintenance: Does technology provide
configuration, deployment, scaling, monitoring,
performance tuning and auditing capabilities?
2.2.3 Monitoring & Alerting
2.2.4 Logging
2.2.5 Audit
2.2.6 Transparent Upgrade: Version upgrade with
minimum downtime
22. 22
2. Fit-for-purpose Categories
2.3 Latency & Processing Architecture
2.3.1 Supports tuple at a time, micro-batch,
transactional updates and batch processing
2.3.2 Computational model
2.3.3 Ability to reprocess historical data from source
2.3.4 Ability to reprocess historical data from native
engine
2.3.5 Call external source (API/database calls)
2.3.6 Integration with Batch (static) source
2.3.7 Data Types (images, sound etc.)
2.3.8 Supports complex event processing and pattern
detection vs. continuous operator model (low latency,
flow control)
23. 23
2. Fit-for-purpose Categories
2.3 Latency & Processing Architecture
2.3.9 Handles stream imperfections (delayed)
2.3.10 Handles stream imperfections (out-of-order)
2.3.11 Handles stream imperfections (duplicate)
2.3.12 Handles seconds, sub-second or millisecond
event processing (Latency)
2.3.13 Compression
2.3.14 Support for batch analytics
2.3.15 Support for iterative analytics (machine learning,
graph analytics)
2.3.16 Data lineage provenance (origin of the owner)
2.3.17 Data lineage (accelerate recovery time)
24. 24
2. Fit-for-purpose Categories
2.4 State Management
2.4.1 Stateful vs. Stateless
2.4.2 Is stateful data Persisted locally vs.
external database vs. Ephemeral
2.4.3 Native rolling, tumbling and hopping
window support
2.4.4 Native support for integrated data store
25. 25
2. Fit-for-purpose Categories
2.5 Processing Delivery Assurance
2.5.1 Guarantee (At least once)
2.5.2 Guarantee (At most once)
2.5.3 Guarantee (Exactly once)
2.5.4 Global Event order guaranteed
2.5.5 Guarantee predictable and repeatable
outcomes( deterministic or not)
26. 26
2. Fit-for-purpose Categories
2.6 Database Integrations, Native vs. Third
party connector
2.6.1 NoSQL database integration
2.6.2 File Format (Avro, Parquet and other
format support)
2.6.3 RDBMS integration
2.6.4 In-memory database integration/ Caching
integration
27. 27
2. Fit-for-purpose Categories
2.7 High Availability & Resiliency
2.7.1 Can the system avoid slowdown due to straggler node
2.7.2 Fault-Tolerance (does the tool handle
node/operator/messaging failures without catastrophically failing)
2.7.3 State recovery from in-memory
2.7.4 State recovery from reliable storage
2.7.5 Overhead of fault tolerance mechanism (Does failure
handling introduce additional latency or negatively impact
throughput?)
2.7.6 Multi-site support (multi-region)
2.7.7 Flow control: backpressure tolerance from slow operators or
consumers
2.7.8 Fast parallel recovery vs. replication or serial recovery on
one node at a time
28. 28
2. Fit-for-purpose Categories
2.8 Ease of Development
2.8.1 SQL Interface
2.8.2 Real-Time debugging option
2.8.3 Built-in stream oriented abstraction (streams, windows, operators , iterators
- expressive APIs that enable programmers to quickly develop streaming data
applications)
2.8.4 Separation of application logic from fault tolerance
2.8.5 Testing tools and framework
2.8.6 Change management: multiple model deployment ( E.g. separate cluster or
can one create multiple independent redundant streams internally)
2.8.7 Dynamic model swapping (Support dynamic updating of
operators/topology/DAG without restart or service interruption)
2.8.8 Required knowledge of system internals to develop an application
2.8.9 Time to market for applications
2.8.10 Supports plug-in of external libraries
2.8.11 API High Level/Low Level
2.8.12 Easy to configuration
2.8.13 GUI based abstraction layer
29. 29
2. Fit-for-purpose Categories
2.9 Scalability
2.9.1 Supports multi-thread across multiple
processors/cores
2.9.2 Distributed across multiple machines/servers
2.9.3 Partition Algorithm
2.9.4 Dynamic elasticity - Scaling with minimum impact/
performance penalty
2.9.5 Horizontal scaling with linear
performance/throughput
2.9.6 Vertical scaling (GPU)
2.9.7 Scaling without downtime
30. 30
3. Organizational-fit Categories
3.1 Maturity & Community Support
3.2 Support Languages for Development
3.3 Cloud Portability
3.4 Compatibility with Native Hadoop
Architecture
3.5 Adoption of Community vs. Enterprise
Edition
3.6 Integration with Message Brokers
31. 31
3. Organizational-fit Categories
3.1 Maturity & Community Support
3.1.1 Open Source Support
3.1.2 Maturity (years)
3.1.3 Stable
3.1.4 Centralized documentation with versioning
support
3.1.5 Documentation of programming API with good
code examples
3.1.6 Centralized visible roadmap
3.1.7 Community acceptance vs. Vendor driven
3.1.8 Contributors
32. 32
3. Organizational-fit Categories
3.2 Support Languages for Development
3.2.1 Language technology was built on
3.2.2 Language supported to access
technology
33. 33
3. Organizational-fit Categories
3.3 Cloud Portability
3.3.1 Ease of migration between cloud vendors
3.3.2 Ease of migration between on premise to cloud
3.3.3 Ease of migration from on premise to complete
cloud services
3.3.4 Cloud compatibility (AWS, Google, Azure)
34. 34
3. Organizational-fit Categories
3.4 Compatibility with Native Hadoop
Architecture
3.4.1 Implement on top of Hadoop YARN vs.
Standalone
3.4.2 Mesos
3.4.3 Coordination with Apache Zookeeper
37. 37
4. Flink vs. Spark on a few key criteria
1. Streaming Engine
2. Iterative Processing
3. Memory Management
4. Optimization
5. Configuration
6. Tuning
7. Performance
38. 38
4.1. Streaming Engine
Many time-critical applications need to process large
streams of live data and provide results in real-time.
For example:
Financial Fraud detection
Financial Stock monitoring
Anomaly detection
Traffic management applications
Patient monitoring
Online recommenders
Some claim that 95% of streaming use cases can
be handled with micro-batches!? Really!!!
39. 39
4.1. Streaming Engine
Spark’s micro-batching isn’t good enough!
Ted Dunning, Chief Applications Architect at MapR,
talk at the Bay Area Apache Flink Meetup on August
27, 2015
http://www.meetup.com/Bay-Area-Apache-Flink-
Meetup/events/224189524/
Ted described several use cases where batch and micro
batch processing is not appropriate and described
why.
He also described what a true streaming solution needs
to provide for solving these problems.
These use cases were taken from real industrial
situations, but the descriptions drove down to technical
details as well.
40. 40
4.1. Streaming Engine
“I would consider stream data analysis to be a major
unique selling proposition for Flink. Due to its
pipelined architecture, Flink is a perfect match for big
data stream processing in the Apache stack.” – Volker
Markl
Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015
http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/
Apache Flink uses streams for all workloads:
streaming, SQL, micro-batch and batch. Batch is just
treated as a finite set of streamed data. This makes
Flink the most sophisticated distributed open source
Big Data processing engine (not the most mature one
yet!).
42. 42
4.2. Iterative Processing
Flink's API offers two dedicated iteration operations:
Iterate and Delta Iterate.
Flink executes programs with iterations as cyclic
data flows: a data flow program (and all its operators)
is scheduled just once.
In each iteration, the step function consumes the
entire input (the result of the previous iteration, or the
initial data set), and computes the next version of the
partial solution
43. 43
4.2. Iterative Processing
Delta iterations run only on parts of the data that is
changing and can significantly speed up many
machine learning and graph algorithms because the
work in each iteration decreases as the number of
iterations goes on.
Documentation on iterations with Apache Flink
http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html
44. 44
4.2. Iterative Processing
Step
Step
Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
Non-native iterations in Hadoop and Spark are
implemented as regular for-loops outside the system.
45. 45
4.2. Iterative Processing
Although Spark caches data across iterations, it still
needs to schedule and execute a new set of tasks for
each iteration.
Spinning Fast Iterative Data Flows - Ewen et al. 2012 :
http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The
Apache Flink model for incremental iterative dataflow
processing. Academic paper.
Recap of the paper, June 18,
2015http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/
Documentation on iterations with Apache
Flinkhttp://ci.apache.org/projects/flink/flink-docs-
master/apis/iterations.html
46. 46
4.3. Memory Management
Question: Spark vs. Flink low memory available?
Question answered on
stackoverflow.comhttp://stackoverflow.com/questions/31935299/
spark-vs-flink-low-memory-available
The same question still unanswered on the Apache
Spark Mailing List!! http://apache-flink-user-mailing-list-
archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-
td2364.html
47. 47
4.3. Memory Management
Features:
C++ style memory management inside the JVM
User data stored in serialized byte arrays in JVM
Memory is allocated, de-allocated, and used strictly
using an internal buffer pool implementation.
Advantages:
1. Flink will not throw an OOM exception on you.
2. Reduction of Garbage Collection (GC)
3. Very efficient disk spilling and network transfers
4. No Need for runtime tuning
5. More reliable and stable performance
48. 48
4.3. Memory Management
public class WC {
public String word;
public int count;
}
empty
page
Pool of Memory Pages
Sorting,
hashing,
caching
Shuffles/
broadcasts
User code
objects
ManagedUnmanagedFlink contains its own memory management stack.
To do that, Flink contains its own type extraction
and serialization components.
JVM Heap
Network
Buffers
49. 49
4.3. Memory Management
Peeking into Apache Flink's Engine Room - by Fabian
Hüske, March 13, 2015 http://flink.apache.org/news/2015/03/13/peeking-
into-Apache-Flinks-Engine-Room.html
Juggling with Bits and Bytes - by Fabian Hüske, May
11,2015
https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
Memory Management (Batch API) by Stephan Ewen-
May 16,
2015https://cwiki.apache.org/confluence/pages/viewpage.action?pageId
=53741525
Flink added an Off-Heap option for its memory
management component in Flink 0.10:
https://issues.apache.org/jira/browse/FLINK-1320
50. 50
4.3. Memory Management
Compared to Flink, Spark is still behind in custom
memory management but is catching up with its
project Tungsten for Memory Management and
Binary Processing: manage memory explicitly and
eliminate the overhead of JVM object model and
garbage collection. April 28,
2014https://databricks.com/blog/2015/04/28/project-tungsten-bringing-
spark-closer-to-bare-metal.html
It seems that Spark is adopting something similar to
Flink and the initial Tungsten announcement read
almost like Flink documentation!!
51. 51
4.4 Optimization
Apache Flink comes with an optimizer that is
independent of the actual programming interface.
It chooses a fitting execution strategy depending on
the inputs and operations.
Example: the "Join" operator will choose between
partitioning and broadcasting the data, as well as
between running a sort-merge-join or a hybrid hash
join algorithm.
This helps you focus on your application logic
rather than parallel execution.
Quick introduction to the Optimizer: section 6 of the
paper: ‘The Stratosphere platform for big data
analytics’http://stratosphere.eu/assets/papers/2014-
VLDBJ_Stratosphere_Overview.pdf
52. 52
4.4 Optimization
Run locally on a data
sample
on the laptop
Run a month later
after the data evolved
Hash vs. Sort
Partition vs. Broadcast
Caching
Reusing partition/sort
Execution
Plan A
Execution
Plan B
Run on large files
on the cluster
Execution
Plan C
What is Automatic Optimization? The system's built-in
optimizer takes care of finding the best way to
execute the program in any environment.
53. 53
4.4 Optimization
In contrast to Flink’s built-in automatic optimization,
Spark jobs have to be manually optimized and
adapted to specific datasets because you need to
manually control partitioning and caching if you
want to get it right.
Spark SQL uses the Catalyst optimizer that
supports both rule-based and cost-based
optimization. References:
Spark SQL: Relational Data Processing in
Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
Deep Dive into Spark SQL’s Catalyst Optimizer
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-
optimizer.html
54. 54
4.5. Configuration
Flink requires no memory thresholds to
configure
Flink manages its own memory
Flink requires no complicated network
configurations
Pipelining engine requires much less
memory for data exchange
Flink requires no serializers to be configured
Flink handles its own type extraction and
data representation
55. 55
4.6. Tuning
According to Mike Olsen, Chief Strategy Officer of
Cloudera Inc. “Spark is too knobby — it has too many
tuning parameters, and they need constant
adjustment as workloads, data volumes, user counts
change. Reference: http://vision.cloudera.com/one-
platform/
Tuning Spark Streaming for Throughput By Gerard
Maas from Virdata. December 22, 2014
http://www.virdata.com/tuning-spark/
Spark Tuning:
http://spark.apache.org/docs/latest/tuning.html
56. 56
4.6. Tuning
Run locally on a data
sample
on the laptop
Run a month later
after the data evolved
Hash vs. Sort
Partition vs. Broadcast
Caching
Reusing partition/sort
Execution
Plan A
Execution
Plan B
Run on large files
on the cluster
Execution
Plan C
What is Automatic Optimization? The system's built-in
optimizer takes care of finding the best way to
execute the program in any environment.
57. 57
7. Performance
Why Flink provides a better performance?
Custom memory manager
Native closed-loop iteration operators make graph
and machine learning applications run much faster.
Role of the built-in automatic optimizer. For
example: more efficient join processing.
Pipelining data to the next operator in Flink is more
efficient than in Spark.
See benchmarking results against Flink here:
http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-
data-analytics-frameworks/87
58. 58
5. Future work
The framework from Capital One to evaluate stream
processing tools is being refined and will be
published at http://www.capitalone.io/
The assessment of the major open source streaming
tools will be published as well as a live document
continuously updated by Capital One.
I also have a work in progress on comparing Spark
and Flink as multi-purpose Big Data analytics
framework
Check my blog at http://www.SparkBigData.com
Check also my slide decks on the Flink and Spark on
http://slideshare.net/sbaltagi
59. 59
Agenda
I. Motivation for this talk
II. Apache Flink vs. Apache Spark?
III. How Flink is used at Capital
One?
IV. What are some key takeaways?
60. 60
III. How Flink is used at Capital One?
We started our journey with Apache Flink at Capital
One while researching and contrasting stream
processing tools in the Hadoop ecosystem with a
particular interest in the ones providing real-time
stream processing capabilities and not just micro-
batching as in Apache Spark.
While learning more about Apache Flink, we
discovered some unique capabilities of Flink which
differentiate it from other Big Data analytics tools not
only for Real-Time streaming but also for Batch
processing.
We evaluated Apache Flink Real-Time stream
processing capabilities in a POC.
61. 61
III. How Apache Flink is used at Capital One?
Where are we in our Flink journey?
Successful installation of Apache Flink 0.9 in our
Pre-Production cluster running on CDH 5.4 with
security and High Availability enabled.
Successful installation of Apache Flink 0.9 in a 10
nodes R&D cluster running HDP.
Successful completion of Flink POC for real-time
stream processing. The POC proved that propriety
system can be replaced by a combination of tools:
Apache Kafka, Apache Flink, Elasticsearch and
Kibana in addition to advanced real-time streaming
analytics.
62. 62
III. How Apache Flink is used at Capital One?
What are the opportunities for using Apache
Flink at Capital One?
1. Real-Time streaming analytics
2. Cascading on Flink
3. Flink’s MapReduce Compatibility Layer
4. Flink’s Storm Compatibility Layer
5. Other Flink libraries (Machine Learning
and Graph processing) once they come
out of beta.
63. 63
III. How Apache Flink is used at Capital One?
Cascading on Flink:
First release of Cascading on Flink was announced
recently by Data Artisans and Concurrent. It will be
supported in upcoming Cascading 3.1.
Capital One is the first company verifying this release
on real-world Cascading data flows with a simple
configuration switch and no code re-work needed!
This is a good example of doing analytics on bounded
data sets (Cascading) using a stream processor (Flink)
Expected advantages of performance boost and less
resource consumption.
Future work is to support ‘Driven’ from Concurrent Inc.
to provide performance management for Cascading
data flows running on Flink.
64. 64
III. How Apache Flink is used at Capital One?
Flink’s compatibility layer for Storm:
We can execute existing Storm topologies
using Flink as the underlying engine.
We can reuse our application code (bolts and
spouts) inside Flink programs.
Flink’s libraries (FlinkML for Machine
Learning and Gelly for Large scale graph
processing) can be used along Flink’s
DataStream API and DataSet API for our end to
end big data analytics needs.
65. 65
Agenda
I. Motivation for this talk
II. Apache Flink vs. Apache Spark?
III. How Flink is used at Capital One?
IV. What are some key takeaways?
66. 66
III. What are some key takeaways?
Neither Flink nor Spark will be the single analytics
framework that will solve every Big Data problem!
By design, Spark is not for real-time stream processing
while Flink provides a true low latency streaming
engine and advanced DataStream API for real-time
streaming analytics.
Although Spark is ahead in popularity and adoption,
Flink is ahead in technology innovation and is growing
fast.
It is not always the most innovative tool that gets the
largest market share, the Flink community needs to
take into account the market dynamics!
Both Spark and Flink will have their sweet spots
despite their “Me too syndrome”.
67. 67
Thanks!
• To all of you for attending!
• To Capital One for giving me the
opportunity to meet with the growing
Apache Flink family.
• To the Apache Flink community for the
great spirit of collaboration and help.
• 2016 will be the year of Apache Flink!
• See you at FlinkForward 2016!