Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
Building a Real-Time Analytics Application with
Apache Pulsar and Apache Pinot
While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems.
Presenters: Mary Grygleski - Streaming Developer Advocate &
Mark Needham - Developer Relations Engineer at StarTree
Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
Building a Real-Time Analytics Application with
Apache Pulsar and Apache Pinot
While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems.
Presenters: Mary Grygleski - Streaming Developer Advocate &
Mark Needham - Developer Relations Engineer at StarTree
Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems.
There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems.
There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed real-time database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS).
Building upon this, I explain how to build common business functionality by stepping through the patterns for: – Scalable payment processing – Run it on rails: Instrumentation and monitoring – Control flow patterns Finally, all of these concepts are combined in a solution architecture that can be used at an enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed real-time database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS).
Building upon this, I explain how to build common business functionality by stepping through the patterns for: – Scalable payment processing – Run it on rails: Instrumentation and monitoring – Control flow patterns Finally, all of these concepts are combined in a solution architecture that can be used at an enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
How to Quantify the Value of Kafka in Your Organization confluent
(Lyndon Hedderly, Confluent) Kafka Summit SF 2018
We all know real-time data has a value. But how do you quantify that value in order to create a business case for becoming more data, or event driven?
The first half of this talk will explore the value of data across a variety of organizations, starting with the five most valuable companies in the world: Apple, Alphabet (Google), Microsoft, Amazon and Facebook (based on stock prices July 2017). We will go on to discuss other digital natives: Uber, Ebay, Netflix and LinkedIn, before exploring more traditional companies across retail, finance and automotive. Next, we’ll look at non-businesses such as governments and lobbyists. Whether organizations are using data to create new business products and services, improve user experiences, increase productivity, manage risk or influencing global power, we’ll see that fast and interconnected data, or “event streaming” is increasingly important.
After showing that data value can be quantified, the second half of this talk will explain the five steps to creating a business case.
Most businesses focus on:
-Making more money or conferring competitive advantage to make more money
-Increasing efficiency to save money and/or
-Mitigating risk to the business to protect money
-We’ll walk through examples of real business cases, discuss how business cases have evolved over the years and show the power of a sound business case. If you’re interested in big money and big business, as well as big data, this talk is for you.
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida CLARA CAMPROVIN
Análisis empresariales cuando los necesite, en cualquier lugar
Jet Enterprise es una solución de inteligencia empresarial y generación de informes desarrollada específicamente para satisfacer las necesidades propias de los usuarios de Microsoft Dynamics. Ahora puede juntar toda su información en un mismo lugar y permitir que quien usted quiera de la organización realice fácilmente sofisticados análisis empresariales desde cualquier sitio. Capacite a los usuarios para tomar mejores decisiones, más rápido, prácticamente con cualquier dispositivo.
Con Jet Enterprise dispone de:
Una solución completa de inteligencia empresarial y generación de informes, lista para usar en solo 2 horas
Más de 80 paneles y plantillas de informes
7 cubos pregenerados personalizables
Un almacén de datos
Integración directa con sus datos de Microsoft Dynamics y posibilidad de conectarse a otros sistemas empresariales pertinentes
Posibilidad de crear paneles en cuestión de minutos, sin necesidad de conocer la estructura de datos subyacente
Jet Mobile opcional, para acceder a sus datos desde cualquier sitio a través de un navegador web o un dispositivo móvil
Una plataforma robusta de automatización y personalización del almacenamiento de datos
«Comenzamos con datos de Sage Pro, datos de NAV 2009 y, además, datos incorporados de la nueva empresa que habíamos adquirido, por lo que ahora estamos usando tres sistemas de datos. Las ventajas de combinar los tres sistemas en Jet Enterprise han sido enormes».
– Davis & Shirtliff
Éxito inmediato = rápido ROI y bajo coste de propiedad
Muchas soluciones de inteligencia empresarial conllevan costes ocultos, como implementaciones prolongadas y difíciles, personalizaciones caras y precio elevado de las licencias cuando se amplían a un gran número de usuarios. Jet Enterprise se suele instalar en unas dos horas, requiere un nivel mínimo de formación de los usuarios y ofrece licencias para un número ilimitado de usuarios. Los usuarios habitualmente experimentan un incremento de los ingresos brutos en los primeros 12 meses de uso.
Watch full webinar here: https://buff.ly/2mHGaLA
What started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics.
Attend this session to learn:
• What data virtualization really is
• How it differs from other enterprise data integration technologies
• Why data virtualization is finding enterprise-wide deployment inside some of the largest organizations
Digital revolution is disrupting businesses like never before! Ability to extract actionable insight from a large amount of disparate data has become the determining factor of competitive advantage! Everyday new business models are created around data and forcing the incumbents to reinvent themselves to be relevant. Consumer facing businesses felt this pressure early on but eventually every business need to be data driven. But what is the best strategy to address this digital disruption? Our experience says the core data infrastructure modernization is the logical starting point! In this session, we will share trends, strategies and our experience on rejuvenating data integration landscape to address digital disruptions.
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
The art of the event streaming application: streams, stream processors and sc...confluent
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database. In this talk I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery ToolsAntonio Rolle
Review why a CMDB is essential to and is the foundation of your BSM strategy
Outline the known challenges that require planning at the outset of a CMDB initiative
Drill down into the approach and lessons learned in the initial stages of a CMDB rollout for one of the largest financial institutions in North America
Event-driven architectures have been around for a long time, but new trends and innovations in "serverless" computing, data streaming, and Agile practices have created the ground for an evolutionary step that will have significant impact on the way we design and build software over the next decade or more. Much like APIs drove a revolution in public services for RPC, REST, and similar "pull" use cases across organization boundaries, the market now promises to similarly define standard mechanisms to enable "push" notifications of discrete data and activities. This practice, which we call Flow, will drive a revolution in interconnectivity similar to what we saw with HTML and REST.Agile is central to the success of these mechanisms, and is one of the key reasons why this will happen sooner rather than later. The ability to adapt quickly to customer needs, combined with the ability to react quickly to new and changing event sources, is required to make event-driven practices work. In this presentation, James Urquhart describes the changes on our horizon, discuss existing architectures, mechanisms and organizations that are leading the way, and talk specifically about how Agile teams are well prepared to both drive and benefit from Flow systems. The presentation is targeted at technology, development, and product leaders who wish to understand how Flow fits into their architecture portfolio.
EDA Meets Data Engineering – What's the Big Deal?confluent
Presenter: Guru Sattanathan, Systems Engineer, Confluent
Event-driven architectures have been around for many years, much like Apache Kafka®, which first open sourced in 2011. The reality is that the true potential of Kafka is only being realised now. Kafka is becoming the central nervous system of many of today’s enterprises. It is bringing a profound paradigm shift to the way we think about enterprise IT. What has changed in Kafka to enable this paradigm shift? Is it not just a message broker, and how are enterprises using it today? This session will explore these key questions.
Sydney: https://content.deloitte.com.au/20200221-tel-event-tech-community-syd-registration
Melbourne: https://content.deloitte.com.au/20200221-tel-event-tech-community-mel-registration
Watch here: https://bit.ly/2D1fqB6
Today’s evolving data landscape has spawned new business challenges that require innovative solutions. These challenges include:
- Strategic decision-making, which relies on multiple perspectives such as social and economic factors that require combining internal and external data.
- Accounting for the increased volume and structural complexity of today’s data, and increased frequency required in delivering data assets.
- Coping with data silos that house data that must be combined and provisioned to support decision-making.
- Exposing purpose-built analytics, such as supply chain, for consumption in order to expedite decision-making.
Attend this session to learn how Data as a Service, fueled by data virtualization, overcomes these common challenges from the three dimensions of:
- Provisioning information-rich external data assets,
- Connecting data silos, and
- Enabling pre-built and packaged analytics.
The Streaming Assessment – An Introductionconfluent
Business breakout during Confluent’s streaming event in Munich, presented by Lyndon Hedderly, Director of Customer Solutions at Confluent. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.
You got Office 365 and you need to use it. You need to let others in your organization use it because you were designated the admin but you can't spell Office 365... yet. Where do you start? In this session, you will learn about the basics of administering Office 365 through the Admin Centers.
1. Be introduced to Office 365
2. Learn how to setup and manage users
3. Learn how to add your organization's domain
4. Learn about the other 3 major admin centers inside the Office 365 Admin Center: Exchange, Skype for Business, and SharePoint
Similar to Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture (20)
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Flink Forward San Francisco 2022.
At Flink Forward, we get to hear creative, unique use cases, often on the bleeding edge of some of the most exciting current technologies. This talk will give you a chance to get to open up the hood on our driven and innovative Open Source community. I will cover what our community has been working on this past year, and how this work relates to our (Ververica's) exciting new Flink engineering roadmap! I will also go through some best practices and upcoming opportunities for getting involved in this community!
by
Caito Scherr
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Using Queryable State for Fun and ProfitFlink Forward
Flink Forward San Francisco 2022.
A particular feature in our system relies on a streaming 90-minute trailing window of 1-minute samples - implemented as a lookaside cache - to speed up a particular query, allowing our customers to rapidly see an overview of their estate. Across our entire customer base, there is a substantial amount of data flowing into this cache - ~1,000,000 entries/second, with the entire cache requiring ~600GB of RAM. The current implementation is simplistic but expensive. In this talk I describe a replacement implementation as a stateful streaming Flink application leveraging Queryable State. This Flink application reduces the net cost by ~90%. In this session, the implementation is described in detail, including windowing considerations, a sliding-window state buffer that avoids the sliding window replication penalty, and a comparison of queryable state and Redis queries. The talk concludes with a frank discussion of when this distinctive approach is, and is not, appropriate.
by
Ron Crocker
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
Flink Forward San Francisco 2022.
Neuro-ID analyzes web behavior at a large scale to determine visitors' intent on web pages, specifically in the online lending industry. When users interact with an online loan application, our software analyzes their behavior to determine if the applicant may be potentially fraudulent. Lenders can then request various scores describing the applicant's intentions in real-time to use to make decisions during the application flow. Flink gives our product the ability to observe behavior in a stateful manner. As an applicant interacts with an online loan application, a Flink application is used to compare earlier actions to later actions. This processing in Flink can determine the applicant's intent throughout the process of the application.
by
Jeff Niemann & Randy Hanak
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
2. An API that gets out of your way
It’s so easy, we’ve embedded a bunch of examples right
here. Copy some of these requests into your terminal and
check out what happens.
With wrappers in Ruby, PHP, Python and more, you can
get started in minutes. Learn More ➤
3. As complexity grew…
Then we had a ProblemFactory
Started out with
We had a problem, so we thought to use …
4. As data volume grew…
Database scalability is a complicated topic…
Started out with
Had to make sure it was web scale
Distributed transactions
Change Data Capture
5.
6. Squirreling Away $640 Billion
Flink Forward - San Francisco 2022
Jeff Chao
Staff Engineer / Tech Lead for Change Data Capture Infrastructure at Stripe
How Stripe Leverages Flink for Change Data Capture
7. 7
CDC at Stripe
Agenda
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
From idea to production—things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
8. Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
8
From idea to production—things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
Agenda
CDC at Stripe
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
15. Interoperable
Abstract Away Internals
Operational Excellence
15
Building a Platform
Make sure that we abstract away
database internals such as sharding
topology and ensure a datastore-agnostic
transport.
Build a high leveraged platform which
makes working with Change Events
interoperable with other systems within
the organization.
Minimal toil given as we scale the number
of datasets, ensure clean separation
between infrastructure and user issues,
create great operator experiences, reduce
control plane and data plane blast radius,
maintain good operator tooling/developer
experience/processes.
CDC at Stripe
16. 16
Agenda
CDC at Stripe
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
From idea to production—things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
17. Why?
17
Aggregating Change Events
Product teams working with payments data use transactions
Arbitrary number of tables in a database transaction
They should be able to get transactions back out from the CDC path
They shouldn’t have to become stream processing experts
34. What is an Aggregated Change Event?
34
{
"ts_utc" : 1659375300000,
"data": [
{
"operation": "CREATE",
"transaction": { “id”: "txn1"},
"before": null,
"after": { ... },
},
{
"operation": "UPDATE",
"transaction": { “id”: "txn1"},
"before": { ... },
"after": { ... },
},
]
}
● One transaction with two events
having the same transaction ID.
● Events may arrive from an
arbitrary number of tables.
Aggregating Change Events
37. Joins elements of the same
key within the same window.
● Produces pairwise
elements
Join
37
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Event 1 BEGIN
,
Event 1 COMMIT
,
Event 2 BEGIN
,
Event 2 COMMIT
,
Event 3 BEGIN
,
Event 3 COMMIT
,
Aggregating Change Events
38. Unions multiple streams of
the same type into a single
stream.
● Requires streams of the
same type
Union
38
38
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
(No output; won’t compile because streams are of different
types)
Aggregating Change Events
39. Connect
39
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Event 1 BEGIN
, Event 2 COMMIT
,
Event 3 BEGIN
, COMMIT
,
, ,
Unions multiple streams,
potentially of different types.
● Similar to Unions
Aggregating Change Events
40. 40
Support for streams of different types
Support for flexible stream combination semantics
Don’t need pairwise outputs
Aggregating Change Events
What Do We Need?
41. Flink Job Definition
41
val mainStream =
transactionMetadataEventStream // uid and name omitted.
.connect(changeEventStream) // Union different types.
Aggregating Change Events
44. Wraps an event containing one
of two types, either from left or
right stream.
● Out-of-box
● No concept of keys
Either.left =
Either.right = null
Either
44
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Event 1
BEGIN
, Either.left = null
Either.right =
,
…
Aggregating Change Events
45. WrappedEvent.key = txn-1
WrappedEvent.left = null
WrappedEvent.right =
Custom
45
WrappedEvent.key = txn-1
WrappedEvent.left =
WrappedEvent.right = null
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Event 1
BEGIN
,
, …
Wraps an event containing one
of two types, either from left or
right stream, and a common
key among both events.
● Small and simple code
addition
● Need to extract keys
Aggregating Change Events
46. 46
Wrap elements of a connected stream
Be able to identify keys to support
aggregations later
Aggregating Change Events
What Do We Need?
47. Flink Job Definition
47
val mainStream =
transactionMetadataEventStream // uid and name omitted.
.connect(changeEventStream) // Union different types.
.flatMap(new WrappedEventFunction) // Like Either type, but
with extra fields.
.keyBy(_.key) //
Group events with the same transaction ID.
Aggregating Change Events
49. Aggregation Characteristics
Arbitrary number of Change Event Streams
One Transaction Metadata Event Stream
Change Events must have the same
transaction IDs
Handle late arriving or duplicate Change
Events and Transaction Metadata Events
Don’t result in infinite state growth
49
Aggregating Change Events
51. Tumbling Windows
51
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Aggregating Change Events
52. Tumbling Windows
52
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
Aggregating Change Events
53. Tumbling Windows
53
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Add delay.
Aggregating Change Events
54. Tumbling Windows
54
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Add delay.
Aggregating Change Events
55. Tumbling Windows
55
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Add delay.
● Large delay? Trade-off: Freshness vs Correctness.
Aggregating Change Events
56. Tumbling Windows
56
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Add delay.
● Large delay? Trade-off: Freshness vs Correctness.
● Not quite right…
Aggregating Change Events
57. Sliding Windows
57
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Assigns elements to windows
of a fixed size, but with a slide
interval.
● Almost like a tumbling
window, but with windows
overlapping
Aggregating Change Events
58. Sliding Windows
58
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Same as tumbling windows.
● Slide interval? Explosion of windows
● Not quite right…
Aggregating Change Events
Assigns elements to windows
of a fixed size, but with a slide
interval.
● Almost like a tumbling
window, but with windows
overlapping
59. Session Windows
59
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
BEGIN COMMIT
Event 3
Aggregating Change Events
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
60. Session Windows
60
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
61. Session Windows
61
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
62. Session Windows
62
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
63. Session Windows
63
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
64. Session Windows
64
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
65. Session Windows
65
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
● Session gap too big? Trade-off: Freshness vs Correctness
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
66. Session Windows
66
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
● Session gap too big? Trade-off: Freshness vs Correctness
● Not quite right…
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
67. Global Windows
67
Assigns elements to a single
window.
● Only a single window per
key
● Window never closes
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
BEGIN COMMIT
Event 3
Aggregating Change Events
68. Global Windows
68
Assigns elements to a single
window.
● Only a single window per
key
● Window never closes
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
BEGIN COMMIT
Event 3
● Outputs never get evaluated and materialized
● Needs more…
Aggregating Change Events
69. Global Windows + Custom Stateful Trigger
69
Assign elements to a Global Window and add a custom
stateful trigger.
● Flexibly define open/close conditions for non-
overlapping windows
● Reasonably handle late-arriving events
● Avoid infinite state growth and reduce likelihood of
incomplete aggregates
Aggregating Change Events
70. What Makes an Aggregation Complete?
70
Aggregating Change Events
BEGIN transaction marker seen
COMMIT transaction marker seen
All Change Events of the transaction seen
All Change Events are globally and locally ordered
71. Custom Stateful Trigger:
TransactionBoundaryTrigger
71
if transaction metadata event:
if begin transaction marker:
update begin marker state
else:
update commit marker state
update bitmap state
using commit marker’s total event count
set timeout state and register event time timer
else:
update bitmap state
with change event’s global position
set timeout state and register event time timer
if should trigger(begin, commit, total events):
clear window
TriggerResult.FIRE_AND_PURGE
else:
TriggerResult.CONTINUE
Reference
Aggregating Change Events
// ChangeEvent#transaction
{
"id" : "transaction-id",
"global_position": 1,
"source_position": 1,
}
// TransactionMetadataEvent
{
"id" : "transaction-id",
"ts_utc": 1659375300000,
"marker": "COMMIT",
"total_events": 3,
"per_source_event_counts": [{ ... }],
}
72. val mainStream =
transactionMetadataEventStream // uid and name omitted.
.connect(changeEventStream) // Union different types.
.flatMap(new WrappedEventFunction) // Like Either type, but
with extra fields.
.keyBy(_.key) //
Group events with the same transaction ID.
Flink Job Definition
72
.window(GlobalWindows.create)
.trigger(new TransactionBoundaryTrigger(...)) // Flexible windowing semantics.
.process(new KeyedProcessor(...))
Aggregating Change Events
74. val mainStream =
transactionMetadataEventStream // uid and name omitted.
.connect(changeEventStream) // Union different types.
.flatMap(new WrappedEventFunction) // Like Either type, but
with extra fields.
.keyBy(_.key) //
Group events with the same transaction ID.
.window(GlobalWindows.create)
.trigger(new TransactionBoundaryTrigger(...)) // Flexible windowing semantics.
.process(new KeyedProcessor(...))
Flink Job Definition
74
mainStream //
Side output to DLQ.
.getSideOutput(...)
.addSink(...)
mainStream //
Output aggregated change events.
.addSink(...)
Aggregating Change Events
75. 75
Agenda
CDC at Stripe
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
From idea to production—things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
76. From Idea to Production
76
Coverage
Platform
State
How it Started, How it Ended
80. Infinite keys due to continuous stream of new transactions
Observations
80
How it Started, How it Ended
Using a Global Window; possible windows not closing properly
No trigger timeouts firing
No watermarks being generated
82. Fix
82
Fixed an upstream issue where transaction IDs were getting mixed up
Reduce parallelism on Source Sub Tasks for all streams
Make sure parallelism ≤ ∑ Topic Partitions
Generally, check with SplitEnumerator classes
How it Started, How it Ended
85. State size still growing, but slower
Observations
85
How it Started, How it Ended
Event time timers firing, sometimes
Watermarks are being generated, but not for all sub tasks
86. New Observations
86
charges
(partitions = 2)
Transaction
Metadata Events
audits
(partitions = 1)
disputes
(partitions = 1)
Source Sub Tasks
Low volume stream
How it Started, How it Ended
87. Possible Fix
87
Switch from event time to processing time
Less precise
Could cause premature trigger firing, resulting in incomplete aggregates
How it Started, How it Ended
88. Actual Fix
88
Add idleness property on sources
Can still use event time
More precise
Not perfect; can still result in incomplete aggregates in edge cases
That’s the reality of streaming
How it Started, How it Ended
92. Don’t want to redeploy every time a new dataset (Kafka Topic) is added
Observations
92
How it Started, How it Ended
Blows away Freshness SLO’s error budget
Poor developer onboarding experience
93. Fix
93
Instead of Kafka Topic List Subscriber, use Regex Subscriber
Subscribe to all topics (for a keyspace) by default
Control plane (external) service produces an event to Broadcast Stream
On broadcast element, use Broadcast State to keep onboarded datasets in state
On element, check Broadcast State and filter for onboarded datasets
How it Started, How it Ended
97. Observations
Incomplete aggregates still happening, but not frequently
97
How it Started, How it Ended
Kafka by default is at-least-once delivery
Many independent streams operating at different speeds
98. Storage will be expensive. Trade-off between confidence and cost-
efficiency: KV store or bloom filter
Move incomplete aggregate measurement out of the Flink Job and into a
system downstream
Fix
98
How it Started, How it Ended
New system needs to dedupe events… for all time?
101. 101
Agenda
CDC at Stripe
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
From idea to production – things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
102. Aggregating Change Events is relatively
straightforward, but the details matter
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Wrap Up
102
Change Data Capture (CDC) is widely-used at
Stripe to improve database reliability and scalability
Flink is a critical component in Stripe’s CDC
infrastructure that allows us to work with financial
streaming data with high data quality guarantees
At what scale? $640B annual in payment volume. Challenging…
Many products, many apps and services, many datasets.
Across many databases of different types. Mongo, MySQL. Multi-region, databases have many shards which are split as volume grows.
Watermarks per partition, not per key. Perhaps note an upstream issue, nonetheless, could have manifested by testing out late events.
Watermark = min parallelism
Keys can go to the same partition, one key could be late, another could not. Watermark will progress. Timeout will fire - incomplete aggregate. Late key comes in and is treated as incomplete aggregate again.
Connect with broadcast stream.
processElement -> check broadcast state
processBroadcastElement -> update state
Union or join. Streams are independent and any one stream can have duplicate. If duplicate, will result in incomplete aggregate for that key. It won’t unless all streams have the same number of duplicates for that key, but unlikely.
Imagine an aggregate was just completed for a key. Then, dup happens and event sits in state until timed out.