My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Building Cloud-Native App Series - Part 2 of 11
Microservices Architecture Series
Event Sourcing & CQRS,
Kafka, Rabbit MQ
Case Studies (E-Commerce App, Movie Streaming, Ticket Booking, Restaurant, Hospital Management)
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyHostedbyConfluent
We wanted to embed a Kafka producer/consumer in C++ and decided to use ""librdkafka"", a robust C/C++ library that is open source, well-maintained, and widely used.
The C++ interface of ""librdkafka"" is confined to C++ 98 for compatibility, which makes it less object-oriented and user-friendly. Raw pointers in the interface complicates object lifecycle management and limited encapsulation leading to a more complex API.
That is why we built a C++ Kafka API using modern C++ as a wrapper on top of ""librdkafka"". This header-only library is quite similar to the Java API, object-oriented and user-friendly. It frees us from the burdens mentioned above so even a novice can pick it up and work out an efficient solution.
The ""modern-cpp-kafka"" project on github has been thoroughly tested within our company. After using it to replace a legacy implementation, throughput for a key middleware system gained 26%!
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Upgrades suck. We get it. They are risky and time consuming and you have better things to do. In this talk we'll present good reasons to upgrade anyway and give suggestions on how to de-risk your upgrades. Straight from the team that upgrades Kafka almost every week. We'll review all the releases in the past year - major, minor and bug-fixes. We'll explain the differences between those and what can you expect from each. We'll go into the most important features and most critical fixes and improvements, so you'll have ample ammunition when you explain to your boss why you really have to upgrade Kafka. Then we'll discuss how we validate new releases and suggest a safe upgrade process - because we know that uneventful upgrades are a key to the next upgrade.
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
What is a Microservices architecture and how does it differ from a Service-Oriented Architecture? Should you use traditional REST APIs to bind services together? Or is it better to use a richer, more loosely-coupled protocol? This talk will start with quick recap of how we created systems over the past 20 years and how different architectures evolved from it. The talk will show how we piece services together in event driven systems, how we use a distributed log (event hub) to create a central, persistent history of events and what benefits we achieve from doing so.
Apache Kafka is a perfect match for building such an asynchronous, loosely-coupled event-driven backbone. Events trigger processing logic, which can be implemented in a more traditional as well as in a stream processing fashion. The talk will show the difference between a request-driven and event-driven communication and show when to use which. It highlights how the modern stream processing systems can be used to hold state both internally as well as in a database and how this state can be used to further increase independence of other services, the primary goal of a Microservices architecture.
Building distributed systems is challenging. Luckily, Apache Kafka provides a powerful toolkit for putting together big services as a set of scalable, decoupled components. In this talk, I'll describe some of the design tradeoffs when building microservices, and how Kafka's powerful abstractions can help. I'll also talk a little bit about what the community has been up to with Kafka Streams, Kafka Connect, and exactly-once semantics.
Presentation by Colin McCabe, Confluent, Big Data Day LA
Getting Started with Confluent Schema Registryconfluent
Getting started with Confluent Schema Registry, Patrick Druley, Senior Solutions Engineer, Confluent
Meetup link: https://www.meetup.com/Cleveland-Kafka/events/272787313/
Apache Kafka's rise in popularity as a streaming platform has demanded a revisit of its traditional at-least-once message delivery semantics.
In this talk, we present the recent additions to Kafka to achieve exactly-once semantics (EoS) including support for idempotence and transactions in the Kafka clients. The main focus will be the specific semantics that Kafka distributed transactions enable and the underlying mechanics which allow them to scale efficiently.
From a kafkaesque story to The Promised LandRan Silberman
LivePerson moved from an ETL based data platform to a new data platform based on emerging technologies from the Open Source community: Hadoop, Kafka, Storm, Avro and more.
This presentation tells the story and focuses on Kafka.
Keystone processes over 1 trillion events per day with at-least once processing semantics in the cloud. We will explore in detail how we have modified and leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in the Amazon AWS cloud within a year.
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Building Cloud-Native App Series - Part 2 of 11
Microservices Architecture Series
Event Sourcing & CQRS,
Kafka, Rabbit MQ
Case Studies (E-Commerce App, Movie Streaming, Ticket Booking, Restaurant, Hospital Management)
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyHostedbyConfluent
We wanted to embed a Kafka producer/consumer in C++ and decided to use ""librdkafka"", a robust C/C++ library that is open source, well-maintained, and widely used.
The C++ interface of ""librdkafka"" is confined to C++ 98 for compatibility, which makes it less object-oriented and user-friendly. Raw pointers in the interface complicates object lifecycle management and limited encapsulation leading to a more complex API.
That is why we built a C++ Kafka API using modern C++ as a wrapper on top of ""librdkafka"". This header-only library is quite similar to the Java API, object-oriented and user-friendly. It frees us from the burdens mentioned above so even a novice can pick it up and work out an efficient solution.
The ""modern-cpp-kafka"" project on github has been thoroughly tested within our company. After using it to replace a legacy implementation, throughput for a key middleware system gained 26%!
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Upgrades suck. We get it. They are risky and time consuming and you have better things to do. In this talk we'll present good reasons to upgrade anyway and give suggestions on how to de-risk your upgrades. Straight from the team that upgrades Kafka almost every week. We'll review all the releases in the past year - major, minor and bug-fixes. We'll explain the differences between those and what can you expect from each. We'll go into the most important features and most critical fixes and improvements, so you'll have ample ammunition when you explain to your boss why you really have to upgrade Kafka. Then we'll discuss how we validate new releases and suggest a safe upgrade process - because we know that uneventful upgrades are a key to the next upgrade.
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
What is a Microservices architecture and how does it differ from a Service-Oriented Architecture? Should you use traditional REST APIs to bind services together? Or is it better to use a richer, more loosely-coupled protocol? This talk will start with quick recap of how we created systems over the past 20 years and how different architectures evolved from it. The talk will show how we piece services together in event driven systems, how we use a distributed log (event hub) to create a central, persistent history of events and what benefits we achieve from doing so.
Apache Kafka is a perfect match for building such an asynchronous, loosely-coupled event-driven backbone. Events trigger processing logic, which can be implemented in a more traditional as well as in a stream processing fashion. The talk will show the difference between a request-driven and event-driven communication and show when to use which. It highlights how the modern stream processing systems can be used to hold state both internally as well as in a database and how this state can be used to further increase independence of other services, the primary goal of a Microservices architecture.
Building distributed systems is challenging. Luckily, Apache Kafka provides a powerful toolkit for putting together big services as a set of scalable, decoupled components. In this talk, I'll describe some of the design tradeoffs when building microservices, and how Kafka's powerful abstractions can help. I'll also talk a little bit about what the community has been up to with Kafka Streams, Kafka Connect, and exactly-once semantics.
Presentation by Colin McCabe, Confluent, Big Data Day LA
Getting Started with Confluent Schema Registryconfluent
Getting started with Confluent Schema Registry, Patrick Druley, Senior Solutions Engineer, Confluent
Meetup link: https://www.meetup.com/Cleveland-Kafka/events/272787313/
Apache Kafka's rise in popularity as a streaming platform has demanded a revisit of its traditional at-least-once message delivery semantics.
In this talk, we present the recent additions to Kafka to achieve exactly-once semantics (EoS) including support for idempotence and transactions in the Kafka clients. The main focus will be the specific semantics that Kafka distributed transactions enable and the underlying mechanics which allow them to scale efficiently.
From a kafkaesque story to The Promised LandRan Silberman
LivePerson moved from an ETL based data platform to a new data platform based on emerging technologies from the Open Source community: Hadoop, Kafka, Storm, Avro and more.
This presentation tells the story and focuses on Kafka.
Keystone processes over 1 trillion events per day with at-least once processing semantics in the cloud. We will explore in detail how we have modified and leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in the Amazon AWS cloud within a year.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
From a Kafkaesque Story to The Promised Land at LivePersonLivePerson
Ran Silberman, developer & technical leader at LivePerson presents how LivePerson moved their data platform from a legacy ETL concept to new "Data Integration" concept of our era.
Kafka is the main infrastructure that holds the backbone for data flow in the new Data Integration. Having that said, Kafka cannot come by itself. Other supporting systems like Hadoop, Storm, and Avro protocol were also integrated.
In this lecture Ran will describe the implementation in LivePerson and will share some tips and how to avoid pitfalls.
Read More: https://connect.liveperson.com/community/developers/blog/2013/11/21/from-a-kafkaesque-story-to-the-promised-land
Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance.
Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.
Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.
___________________________________________
Meetup#7 | Session 2 | 21/03/2018 | Taboola
_____________________________________________
In this talk, we will present our multi-DC Kafka architecture, and discuss how we tackle sending and handling 10B+ messages per day, with maximum availability and no tolerance for data loss.
Our architecture includes technologies such as Cassandra, Spark, HDFS, and Vertica - with Kafka as the backbone that feeds them all.
At Hootsuite, we've been transitioning from a single monolithic PHP application to a set of scalable Scala-based microservices. To avoid excessive coupling between services, we've implemented an event system using Apache Kafka that allows events to be reliably produced + consumed asynchronously from services as well as data stores.
In this presentation, I talk about:
- Why we chose Kafka
- How we set up our Kafka clusters to be scalable, highly available, and multi-data-center aware.
- How we produce + consume events
- How we ensure that events can be understood by all parts of our system (Some that are implemented in other programming languages like PHP and Python) and how we handle evolving event payload data.
Peng Kang, Software Engineer, Dropbox + Richi Gupta, Engineering Manager, Dropbox
As a scalable and reliable data streaming solution with a rich ecosystem, Kafka is widely adopted in Dropbox infrastructure in various scenarios. It is part of Dropbox’s analytics data pipeline, stream processing platform and more mission critical systems. Jetstream is the team that provides Kafka as a service in Dropbox infrastructure. We manage the clusters, develop tooling, and enforce policies, so that our users can enjoy a highly available and reliable service. In this talk, we will share our experiences and learnings running Kafka clusters, pipelines that enable high durability (direct writes to kafka) and availability (goscribe), the policies we enforce for high reliability, the tooling we have for maintenance and stress testing, and finally an overview of Dropbox’s next generation queueing service built on top Kafka.
https://www.meetup.com/KafkaBayArea/events/266327152/
World of Tanks Experience of Using KafkaLevon Avakyan
In this paper I speak about BigWorld technology, WoT server, Apache Kafka and how we started to use it together. What difficulties we had and how we had solved them.
Apache Kafka as Message Queue for your microservices and other occasionsMichael Reinsch
This talk provides a quick intro to Apache Kafka, the basic concepts, and why it's good as a message queue.
We'll also explore the benefits and challenges of using a message queue as base of your microservices infrastructure (especially when transitioning from a monolith).
Similar to Reducing Microservice Complexity with Kafka and Reactive Streams (20)
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
25. • Distributed, partitioned, replicated
commit log service
• Pub/Sub messaging functionality
• Created by LinkedIn, now an
Apache open-source project
What is Kafka?
28. • Send messages to topics
• Responsible for choosing which
partition to send to
• Round-robin
• Consistent hashing based on a
message key
Producers
29. • Pull messages from topics
• Track their own offset in each
partition
Consumers
32. • Hundreds of MB/s of reads/writes from
thousands of concurrent clients
• LinkedIn (2015)
• 800 billion messages per day (18 million/s
peak)
• 175 TB of data produced per day
• > 1000 servers in 60 clusters
Kafka is Fast
33. • Brokers
• All data is persisted to disk
• Partitions replicated to other nodes
• Consumers
• Start where they left off
• Producers
• Can retry - at-least-once messaging
Kafka is Resilient
34. • Capacity can be added at runtime
with zero downtime
• More servers => more disk space
• Topics can be larger than any single
node could hold
• Additional partitions can be added to
add more parallelism
Kafka is Scalable
35. • Large storage capacity
• Topic retention is a Consumer SLA
• Almost impossible for a fast
producer to overload a slow
consumer
• Allows real-time as well as batch
consumption
Kafka Helps with Back-Pressure
39. • Standard for async stream
processing with non-blocking back-
pressure
• Subscriber signals demand to publisher
• Publisher sends no more than demand
• Low-level
• Mainly meant for library authors
Reactive Streams
49. • Sink - sends message to Kafka topic
• Flow - sends message to Kafka topic +
emits result downstream
• When the stream completes/fails the
connection to Kafka will be
automatically closed
Producer
50. • Source - pulls messages from
Kafka topics
• Offset Management
• Back-pressure
• Materialization
• Object that can stop the consumer
(and complete the stream)
Consumer
51. Simple Producer Example
implicit val system = ActorSystem("producer-test")
implicit val materializer = ActorMaterializer()
val producerSettings = ProducerSettings(
system, new ByteArraySerializer, new StringSerializer
).withBootstrapServers("localhost:9092")
Source(1 to 100)
.map(i => s"Message $i")
.map(m => new ProducerRecord[Array[Byte], String]("lower", m))
.to(Producer.plainSink(producerSettings)).run()
52. Simple Consumer Example
implicit val system = ActorSystem("producer-test")
implicit val materializer = ActorMaterializer()
val consumerSettings = ConsumerSettings(
system, new ByteArrayDeserializer, new StringDeserializer, Set("lower")
).withBootstrapServers("localhost:9092").withGroupId("test-group")
val control = Consumer.atMostOnceSource(consumerSettings.withClientId("client1"))
.map(record => record.value)
.to(Sink.foreach(v => println(v))).run()
control.stop()
53. val control = Consumer.committableSource(consumerSettings.withClientId("client1"))
.map { msg =>
val upper = msg.value.toUpperCase
Producer.Message(
new ProducerRecord[Array[Byte], String]("upper", upper),
msg.committableOffset)
}.to(Producer.commitableSink(producerSettings)).run()
control.stop()
Combined Example
56. • Microservices have many advantages, but can
introduce failure and complexity.
• Asynchronous messaging can help reduce this
complexity and Kafka is a great option.
• Akka Streams makes reliably processing data
from Kafka with back-pressure easy
Wrap-Up