Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Kafka Streams State Stores Being Persistentconfluent
Being Persistent: A Look Into Kafka Streams State Stores, Neil Buesing, Principal Solutions Architect, Rill Data
Meetup link: https://www.meetup.com/TwinCities-Apache-Kafka/events/284002062/
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Video: https://www.youtube.com/watch?v=F7HQd3KX2TQ&list=PLDX4T_cnKjD207Aa8b5CsZjc7Z_KRezGz&index=48&t=6s
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Flink at netflix paypal speaker seriesMonal Daxini
* Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events flowing through the Keystone stream processing infrastructure to help glean business insights and improve customer experience. The self-serve infrastructure enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share our experience building building this platform with Flink, and lessons learnt.
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
Failures are inevitable. How can we recover a Flink job from outage? How do we reprocess data from outage period? What are the implications to downstream consumers? These are important questions that we need to answer when running Flink for critical data processing applications. We implemented two solutions for our stream processing platform: (1) use data warehouse, like Hive, as backfill source (2) rewind Flink job using external checkpoint. We will describe both solutions in details, and discuss the pros and cons of each approach. We will also take a look at some of the caveats to watch out for.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Kafka Streams State Stores Being Persistentconfluent
Being Persistent: A Look Into Kafka Streams State Stores, Neil Buesing, Principal Solutions Architect, Rill Data
Meetup link: https://www.meetup.com/TwinCities-Apache-Kafka/events/284002062/
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Video: https://www.youtube.com/watch?v=F7HQd3KX2TQ&list=PLDX4T_cnKjD207Aa8b5CsZjc7Z_KRezGz&index=48&t=6s
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Flink at netflix paypal speaker seriesMonal Daxini
* Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events flowing through the Keystone stream processing infrastructure to help glean business insights and improve customer experience. The self-serve infrastructure enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share our experience building building this platform with Flink, and lessons learnt.
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
Failures are inevitable. How can we recover a Flink job from outage? How do we reprocess data from outage period? What are the implications to downstream consumers? These are important questions that we need to answer when running Flink for critical data processing applications. We implemented two solutions for our stream processing platform: (1) use data warehouse, like Hive, as backfill source (2) rewind Flink job using external checkpoint. We will describe both solutions in details, and discuss the pros and cons of each approach. We will also take a look at some of the caveats to watch out for.
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
When Streaming Needs Batch With Konstantin Knauf | Current 2022HostedbyConfluent
When Streaming Needs Batch With Konstantin Knauf | Current 2022
A streaming application is started once and then continuously ingests endless, fairly steady streams of events. That's as far as the theory goes.
Unfortunately, reality is more complicated. Over time your application's ability to process large historical data sets robustly, efficiently and correctly will be critical:
- for exploratory data analysis during development
- for bootstrapping the initial state of an application
- for back-filling following an outage or bugfix
- for keeping up with bursty input streams
These scenarios call for batch processing techniques. Apache Flink is as streaming-first as it gets. Yet over the last releases, the community has invested significant resources into unifying stream- and batch processing on all layers of the stack: scheduler to APIs.
In this talk, I'll introduce Apache Flink's approach to unified stream and batch processing and discuss - by example - how these scenarios can already be addressed today and what might be possible in the future.
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesLINE Corporation
Yuto Kawamura
LINE / Z Part Team
At LINE we've been operating Apache Kafka to provide the company-wide shared data pipeline for services using it for storing and distributing data.
Kafka is underlying many of our services in some way, not only the messaging service but also AD, Blockchain, Pay, Timeline, Cryptocurrency trading and more.
Many services feeding many data into our cluster, leading over 250 billion daily messages and 3.5GB incoming bytes in 1 second which is one of the world largest scale.
At the same time, it is required to be stable and performant all the time because many important services uses it as a backend.
In this talk I will introduce the overview of Kafka usage at LINE and how we're operating it.
I'm also going to talk about some engineerings we did for maximizing its performance, solving troubles led particularly by hosting huge data from many services, leveraging advanced techniques like kernel-level dynamic tracing.
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...Paris Carbone
An overview of state management techniques employed in Apache Flink including pipelined consistent snapshots and intuitive usages for reconfiguration, which were presented at vldb 2017.
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...Ververica
Learn how the combination of Apache Kafka and Apache Flink is making stateful stream processing even more expressive and flexible to support applications in streaming that were previously not considered streamable.
The new world of applications and fast data architectures has broken up the database: Raw data persistence comes in the form of event logs, and the state of the world is computed by a stream processor. Apache Kafka provides a strong solution for the event log, while Apache Flink forms a powerful foundation for the computation over the event streams.
In this talk we discuss how Flink’s abstraction and management of application state have evolved over time and how Flink’s snapshot persistence model and Kafka’s log work together to form a base to build ‘versioned applications’. We will also show how end-to-end exactly-once processing works through a smart integration of Kafka’s transactions and Flink’s checkpointing mechanism.
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
In diesem Vortrag wird es zunächst einen kurzen Überblick über den aktuellen Stand im Bereich der Streaming-Datenanalyse geben. Danach wird es mit einer kleinen Einführung in das Apache-Flink-System zur Echtzeit-Datenanalyse weitergehen, bevor wir tiefer in einige der interessanten Eigenschaften eintauchen werden, die Flink von den anderen Spielern in diesem Bereich unterscheidet. Dazu werden wir beispielhafte Anwendungsfälle betrachten, die entweder direkt von Nutzern stammen oder auf unserer Erfahrung mit Nutzern basieren. Spezielle Eigenschaften, die wir betrachten werden, sind beispielsweise die Unterstützung für die Zerlegung von Events in einzelnen Sessions basierend auf der Zeit, zu der ein Ereignis passierte (event-time), Bestimmung von Zeitpunkten zum jeweiligen Speichern des Zustands eines Streaming-Programms für spätere Neustarts, die effiziente Abwicklung bei sehr großen zustandsorientierten Streaming-Berechnungen und die Zugänglichkeit des Zustandes von außerhalb.
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiBowen Li
This talk was presented at Scale by the bay conference on Nov 14, 2019.
As the most popular and widely adopted stream processing framework, Apache Flink powers some of the world's largest stream processing use cases in companies like Netflix, Alibaba, Uber, Lyft, Pinterest, Yelp , etc.
In this talk, we will first go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will then take a look at how Flink is going beyond stream processing into areas like unified streaming/batch data processing, enterprise intergration with Hive, AI/machine learning, and serverless computation, how Flink fits with its distinct value, and what development is going on in Flink community to gap.
Uses the example of correct, high-througput, grouping and counting of streaming events as a backdrop for exploring the state-of-the art features of Apache Flink
Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention
Apache Flink is an open source platform for distributed stream and batch data processing. At its core, Flink is a streaming dataflow engine which provides data distribution, communication, and fault tolerance for distributed computations over data streams. On top of this core, APIs make it easy to develop distributed data analysis programs. Libraries for graph processing or machine learning provide convenient abstractions for solving large-scale problems. Apache Flink integrates with a multitude of other open source systems like Hadoop, databases, or message queues. Its streaming capabilities make it a perfect fit for traditional batch processing as well as state of the art stream processing.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro.
Speakers: Gwen Shapira, Xavier Leaute (Confluence)
Gwen is a software engineer at Confluent working on core Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.
Xavier Leaute is One of the first engineers to Confluent team, Xavier is responsible for analytics infrastructure, including real-time analytics in KafkaStreams. He was previously a quantitative researcher at BlackRock. Prior to that, he held various research and analytics roles at Barclays Global Investors and MSCI.
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableHostedbyConfluent
"The shift from batch processing to real-time processing of data is accelerating. Building real-time data applications is a necessity for many businesses as customers expect data to be always up-to-date and their apps to react to changes as they happen. However building and productizing real-time applications is often a complex and lengthy process due to limited serverless options to build such apps.
The introduction of AWS lambdas was a watershed moment in the world of cloud computing. It allowed developers to fire up “fully-managed” computer programs while paying for only when the program ran. Serverless compute comes with three big advantages - improved scalability, reduced cost, and increased flexibility. We’re bringing this same powerful paradigm to real time data processing with Flink in Confluent Cloud. Using this model, users can focus on writing business logic instead of managing nodes and other infrastructure.
Attendees will learn the benefits of serverless and see how it fits into the context of stream processing. We’ll then kick off a demo where we’ll focus on a real world production use case that uses Flink jobs to power an application with extremely low latency."
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Similar to Exactly-Once Financial Data Processing at Scale with Flink and Pinot (20)
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Flink Forward San Francisco 2022.
At Flink Forward, we get to hear creative, unique use cases, often on the bleeding edge of some of the most exciting current technologies. This talk will give you a chance to get to open up the hood on our driven and innovative Open Source community. I will cover what our community has been working on this past year, and how this work relates to our (Ververica's) exciting new Flink engineering roadmap! I will also go through some best practices and upcoming opportunities for getting involved in this community!
by
Caito Scherr
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Using Queryable State for Fun and ProfitFlink Forward
Flink Forward San Francisco 2022.
A particular feature in our system relies on a streaming 90-minute trailing window of 1-minute samples - implemented as a lookaside cache - to speed up a particular query, allowing our customers to rapidly see an overview of their estate. Across our entire customer base, there is a substantial amount of data flowing into this cache - ~1,000,000 entries/second, with the entire cache requiring ~600GB of RAM. The current implementation is simplistic but expensive. In this talk I describe a replacement implementation as a stateful streaming Flink application leveraging Queryable State. This Flink application reduces the net cost by ~90%. In this session, the implementation is described in detail, including windowing considerations, a sliding-window state buffer that avoids the sliding window replication penalty, and a comparison of queryable state and Redis queries. The talk concludes with a frank discussion of when this distinctive approach is, and is not, appropriate.
by
Ron Crocker
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
Flink Forward San Francisco 2022.
Neuro-ID analyzes web behavior at a large scale to determine visitors' intent on web pages, specifically in the online lending industry. When users interact with an online loan application, our software analyzes their behavior to determine if the applicant may be potentially fraudulent. Lenders can then request various scores describing the applicant's intentions in real-time to use to make decisions during the application flow. Flink gives our product the ability to observe behavior in a stateful manner. As an applicant interacts with an online loan application, a Flink application is used to compare earlier actions to later actions. This processing in Flink can determine the applicant's intent throughout the process of the application.
by
Jeff Niemann & Randy Hanak
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
3. Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
3
1
2
3
4. The problem to solve—Ledger dataset
Ledger is a data set that Stripe
maintains to record all money
movements
4
5. Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes; p90 < 1 minute)
5
1
6. Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
6
1
2
7. Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
7
1
2
3
8. Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
No duplicate transactions across the entire history:
● Duplicates are inevitable on the source sides (deployments, restarts, accidental
duplicate job executions etc.)
8
1
2
3
4
9. Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
No duplicate transactions across the entire history
9
1
2
3
4
Near real-time end-to-end exactly-once processing at scale!
10. Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
10
1
2
3
19. Pinot Exactly Once Ingestion (2/5)
19
● Pinot table rows are stored in
immutable chunks/batches called
segments
● Real time segments being indexed are
mutable. Once they are full they will be
“sealed” and become immutable. New
mutable segments will be created to
continue indexing.
20. Pinot Exactly Once Ingestion (3/5)
20
We can consider Pinot’s latest segment as one
database transaction:
● Transaction begins at segment creation
● Transaction is committed when “sealed”
● Kafka offset stored atomically along with Pinot
segment metadata
● If any exception happens, the whole
transaction (segment) is aborted and restarted
21. Pinot Exactly Once Ingestion (4/5)
21
{
"segment.crc": "3251475672",
"segment.creation.time": "1648231912328",
"segment.download.url":
"s3://some/table/mytable__8__0__20220325T1811Z",
"segment.end.time": "1388707200000",
"segment.flush.threshold.size": "4166",
"segment.index.version": "v3",
"segment.realtime.endOffset": "10264",
"segment.realtime.numReplicas": "2",
"segment.realtime.startOffset": "10240",
"segment.realtime.status": "DONE",
"segment.start.time": "1388707200000",
"segment.time.unit": "MILLISECONDS",
"segment.total.docs": "24"
}
● Each segment has one single Zookeeper
node storing its metadata
● Kafka Offsets are stored inside segment
metadata
● Atomicity
○ Zookeeper node update is atomic
○ Kafka offset is updated at the same
time segment status updates
(“DONE”)
22. Pinot Exactly Once Ingestion (5/5)
22
If Pinot server is restarted or crashed
● Whole segment is discarded
● Segment recreated starting from the
next position of offset from Segment_0
● Kafka consumer seek() is called
23. Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
23
1
2
3
24. Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
24
1
25. Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
25
1
2
26. Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
Hard to add additional layers to the architecture due to transactional guarantee.
26
1
2
3
27. Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
Hard to add additional layers to the architecture due to transactional guarantee.
Latency and SLO is impacted by checkpoint intervals.
27
1
2
3
4
28. Potential data loss in two-phase commit
28
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
29. Potential data loss in two-phase commit
29
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
The transaction can be expired in Kafka!
30. Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time
deduplication
30
1
31. Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication
Task local recovery doesn’t work with multiple disks mounted (FLINK-10954)
● Need to hydrate the entire state everytime the job is rescheduled (job failure, host
failure/restarts/recycle)
● Impacts end to end latency
31
1
2
32. Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication
Task local recovery doesn’t work with multiple disks mounted (FLINK-10954)
● Need to hydrate the entire state everytime the job is rescheduled (job failure, host
failure/restarts/recycle)
● Impacts end to end latency
Even if we make local recovery work, Stripe recycles hosts periodically
● The pipeline is as slow as the slowest host to recover state
32
1
2
3
33. Optimizing large state hydration at recovery time
● Task parallelism increase to the rescue: the more threads, the faster to download the state and
to rebuild the local state DB!
33
○ Increasing parallelism requires state
redistribution.
○ Flink uses the concept of key-group
as an atomic unit for state
distribution.
34. Parallelism Increase from 180 to 270 doesn’t work
34
Each task gets 2 key groups assigned
1. 180 tasks have 1 key group
2. 90 tasks have 2 key groups
35. What we want is even distribution
35
Each task gets 2 key groups assigned Each task gets 1 key group assigned
36. Monitoring Large State Size
● Flink can report native RocksDB metrics.
● State backend latency tracking metrics can help debugging.
● Large pending Rocks DB compactions can affect performance.
36
37. Linux OOM Kills Causing Job Restarts
● Flink < 1.12 uses glibc to allocate memory, which leads to memory fragmentation.
● Combined with large states required by the deduplicator app, it consistently causes OOM.
● With large number of task managers and time it takes to rehydrate state, it impacts latency SLO.
37
38. jemalloc Everywhere
● Flink switched to jemalloc for its default memory allocator in Docker images in Flink 1.12.
38
Pre jemalloc Post jemalloc
39. Data Quality Monitoring
● Pinot is an analytics platform that runs SQL blazingly fast, so…
○ Duplicate detection:
■ SELECT primary_key, count(*) as cnt FROM mytable
GROUP BY primary_key HAVING cnt > 1
■ Run query in REALTIME only to help query performance by using special table name
like mytable_REALTIME
○ Missing entry detection:
■ Bucket rows by time and count by bucket
■ JOIN/Compare to source of truth (upstream metric in Data Warehouse)
39
40. How to repair data in Pinot?
● If some range of data are corrupted (contains duplicate)
○ Find the duplicated data by SQL query.
○ Delete and rebuild the Pinot segments containing duplicates.
○ Pinot virtual column names like $segmentName helps locating segments.
● Best Practices
○ A reliable exactly-once Kafka Archive (backup) will come in handy in a fire.
○ Build stable/reliable timestamp into primary key, use that timestamp as Pinot timestamp.
40
41. Lessons Learned
● Flink
○ Set a Kafka transaction timeout large enough to account for any job downtime.
○ Set a parallelism to a number such that max parallelism is divisible by this number.
○ Use jemalloc in Flink.
● Pinot
○ Higher Kafka transaction frequency and shorter Flink checkpoint intervals will improve end
to end data freshness in Pinot.
○ Beware of bogus message counts: Many Kafka internal metrics include messages of failed
transactions.
○ Duplicate monitoring is a must for critical apps.
41
Editor's Notes
Transitional slides from talking about deduplicator to pinot ingestion.
Transitional slides from talking about deduplicator to pinot ingestion.
Transitional slides from talking about deduplicator to pinot ingestion.
Transitional slides from talking about deduplicator to pinot ingestion.
Transitional slides from talking about deduplicator to pinot ingestion.