Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
The session discusses on how companies are using Apache Kafka & also covers under the hood details like partitions, brokers, replication.
About apache kafka: Apache Kafka is a distributed a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and is able to process streams of events. Kafka provides reliable, millisecond responses to support both customer-facing applications and connecting downstream systems with real-time data.
Kafka is most popular messaging queue.
Key Areas:
What is Messgaing Queue?
Why Messaging Queue?
Kafka- basic terminologies
Kafka- Architecture (Message Flow)
AWS SQS vs Apache Kafka
Netflix changed its data pipeline architecture recently to use Kafka as the gateway for data collection for all applications which processes hundreds of billions of messages daily. This session will discuss the motivation of moving to Kafka, the architecture and improvements we have added to make Kafka work in AWS. We will also share the lessons learned and future plans.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
The session discusses on how companies are using Apache Kafka & also covers under the hood details like partitions, brokers, replication.
About apache kafka: Apache Kafka is a distributed a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and is able to process streams of events. Kafka provides reliable, millisecond responses to support both customer-facing applications and connecting downstream systems with real-time data.
Kafka is most popular messaging queue.
Key Areas:
What is Messgaing Queue?
Why Messaging Queue?
Kafka- basic terminologies
Kafka- Architecture (Message Flow)
AWS SQS vs Apache Kafka
Netflix changed its data pipeline architecture recently to use Kafka as the gateway for data collection for all applications which processes hundreds of billions of messages daily. This session will discuss the motivation of moving to Kafka, the architecture and improvements we have added to make Kafka work in AWS. We will also share the lessons learned and future plans.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
What makes Kafka so special among the myriad of event streaming solutions? How easy is it to build a scalable event sourced architectures with it? How do streaming microservices fit into Kafka? What was the Metamorphosis that Apache (not Franz!) Kafka brought to event processing?
In this talk we will introduce Kafka, reveal its inner-workings as well as some exciting tools build around it and compare it with the competition and discuss some of its most traditional use cases.
From a Kafkaesque Story to The Promised Land at LivePersonLivePerson
Ran Silberman, developer & technical leader at LivePerson presents how LivePerson moved their data platform from a legacy ETL concept to new "Data Integration" concept of our era.
Kafka is the main infrastructure that holds the backbone for data flow in the new Data Integration. Having that said, Kafka cannot come by itself. Other supporting systems like Hadoop, Storm, and Avro protocol were also integrated.
In this lecture Ran will describe the implementation in LivePerson and will share some tips and how to avoid pitfalls.
Read More: https://connect.liveperson.com/community/developers/blog/2013/11/21/from-a-kafkaesque-story-to-the-promised-land
Download and view it as a Slide show to view animations.
This is the powerpoint slide with animations to understand different components of Kafka. Discover the core concepts , architecture and practical use-cases of Kafka.
From a kafkaesque story to The Promised LandRan Silberman
LivePerson moved from an ETL based data platform to a new data platform based on emerging technologies from the Open Source community: Hadoop, Kafka, Storm, Avro and more.
This presentation tells the story and focuses on Kafka.
Exactly Once Delivery is a Harsh Mistress - Natan SilnitskyDevOpsDays Tel Aviv
Is Exactly Once Delivery a pipe dream? Recent versions of Kafka have claimed they have made it a reality. In this talk I will go over the different message delivery guarantees and the protocols that implement them. I will focus on Kafka's transaction based messaging protocol, It’s trade-offs and shortcomings. And how it can power Wix's event-driven Infra for its highly distributed environment.
Is Exactly Once Delivery a pipe dream? Recent versions of Kafka have claimed they have made it a reality.
In this talk I will go over the basic theory of messaging in distributed systems, the different message delivery guarantees and the protocols that implement them.
I will focus on exactly once delivery guarantees and the way Kafka implements it with transaction based messaging protocol between the producer and the consumer.
Including a discussion of the latency/throughput trade-offs, resource utilisation and its overall shortcomings.
Finally, I will show how it has helped power Wix's event-driven data-streaming and data-storage Infrastructure we hope to open source soon.
This infrastructure greatly simplifies building and maintaining services in a highly distributed production environment.
It lets developers focus on business logic instead of keep making sure their code is idempotent and fault-tolerant.
Exactly once delivery is a harsh mistress - DevOps Days TLVNatan Silnitsky
In this talk I go over the basic theory of messaging in distributed systems, the different message delivery guarantees in Kafka and the to use them.
I focus on exactly once delivery guarantees and the way Kafka implements it with transaction based messaging protocol.
Including a discussion of the latency/throughput trade-offs, resource utilisation and its overall advantages and shortcomings.
Finally, I show a use-case at Wix where exactly once delivery helped us solve a big problem.
Exactly Once Delivery with Kafka - JOTB2020 Mini SessionNatan Silnitsky
In this talk I go over the basic theory of messaging in distributed systems, the different message delivery guarantees in Kafka and the to use them.
I focus on exactly once delivery guarantees and the way Kafka implements it with transaction based messaging protocol.
Including a discussion of the latency/throughput trade-offs, resource utilisation and its overall advantages and shortcomings.
Finally, I show a use-case at Wix where exactly once delivery helped us solve a big problem.
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...Paul Brebner
Apache Kafka's performance and scalability can be impacted by both hardware and software dimensions. In this presentation, we explore two recent experiences from running a managed Kafka service.
The first example recounts our experiences with running Kafka on AWS's Graviton2 (ARM) instances. We performed extensive benchmarking but didn't initially see the expected performance benefits. We developed multiple hypotheses to explain the unrealized performance improvement, but we could not experimentally determine the cause. We then profiled the Kafka application, and after identifying and confirming a likely cause, we found a workaround and obtained the hoped-for improved price/performance.
The second example explores the ability of Kafka to scale with increasing partitions. We revisit our previous benchmarking experiments with the newest version of Kafka (3.X), which has the option to replace Zookeeper with the new KRaft protocol. We test the theory that Kafka with KRaft can 'scale to millions of partitions' and also provide valuable experimental feedback on how close KRaft is to being production-ready.
Presentation for the ApacheCon NA Performance Engineering Track, October 6, 2022, Sheraton Hotel, New Orleans.
Spinning your Drones with Cadence Workflows and Apache KafkaPaul Brebner
The rapid rise in Big Data use cases over the last decade has been accelerated by popular massively scalable open-source technologies such as Apache Cassandra® for storage, Apache Kafka® for streaming, and OpenSearch® for search. Now there’s a new member of the peloton, Cadence, for orchestration - code-based scalable fault-tolerant workflow orchestration. To illustrate the most important Cadence concepts (and more) we’ll build a realistic drone delivery service demonstration application. We’ll also explore what happens when orchestration meets choreography, and use the drone application to illustrate different ways to integrate Cadence with Apache Kafka, including reusing Kafka microservices. But how scalable is Cadence in practice? We’ll fill the sky with drones - how many drones can we get flying at once?
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Paul Brebner
Modern event-based/streaming distributed systems embrace the idea that change is inevitable and actually desirable! Without being change-aware, systems are inflexible, can’t evolve or react, and are simply incapable of keeping up with real-time real-world data. But how can we speed up an “Elephant” (PostgreSQL) to be as fast as a “Cheetah” (Kafka)? In this talk, we'll introduce the Debezium PostgreSQL Connector, and explain how to deploy, configure and run it on a Kafka Connect cluster, explore the semantics and format of the change data events (including Schemas and Table/Topic mapping), and test the performance. Finally, we'll show how to stream the change data events into an example downstream system, Elasticsearch, using an open source sink connector.
Presentation for PostgresConf.CN and PGConf.Asia 2021 https://www.highgo.ca/2022/01/19/2021-pg-asia-conference-delivered-another-successful-online-conference-again/
Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
Invited keynote for 5th Workshop on Hot Topics in Cloud Computing Performance (HotCloudPerf 2022) https://hotcloudperf.spec.org/ at ICPE 2022 https://icpe2022.spec.org/
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
DeveloperWeek Management 2022 Conference Presentation https://www.developerweek.com/global/conference/management/schedule/
In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
n this Cartoon Style Visual Introduction to Apache Kafka we’re going to build a “Postal Service” to deliver party invitations to two groups, Nerds and Pugsters – find out who goes to the party. Along the way we’ll learn about Kafka Producers, Consumers, Groups, Topics, Partitions, Keys, Records, Delivery Semantics (Guaranteed delivery, and who gets what messages). We’ll also have a quick look at Streams (mail sorting) and Connectors (how does mail get delivered between post offices).
Presentation for Open Source 101 2022: https://opensource101.com/sessions/a-visual-introduction-to-apache-kafka/
Video: https://youtu.be/NUnsHFn52sE
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can efficiently process spatiotemporal data (space and time). In order to find location-specific anomalies, we need ways to represent locations, to index locations, and to query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
ApacheCon NA 2020 Geospatial track presentation https://www.apachecon.com/acah2020/tracks/geospatial.html
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Paul Brebner
With the rapid onset of the global Covid-19 Pandemic from the start of this year the USA Centers for Disease Control and Prevention (CDC) had to quickly implement a new Covid-19 specific pipeline to collect testing data from all of the USA’s states and territories, and carry out other critical steps including integration, cleaning, checking, enrichment, analysis, and enforcing data governance and privacy etc. The pipeline then produces multiple consumable results for federal and public agencies. They did this in under 30 days, using Apache Kafka. In this presentation we'll build a similar (but simpler) pipeline for ingesting, integrating, indexing, searching/analysing and visualising some publicly available tidal data. We'll briefly introduce each technology and component, and walk through the steps of using Apache Kafka, Kafka Connect, Elasticsearch and Kibana to build the pipeline and visualise the results.
Grid Middleware – Principles, Practice and PotentialPaul Brebner
A presentation I gave at UCL, while I was managing the UK OGSA Evaluation Project in 2004, while I was on leave from CSIRO, at UCL Computer Science department, working with Wolfgang Emmerich.
Paul Brebner, University College London, Computer Science Department Seminar: "Grid Middleware - Principles, Practice, and Potential", 1 November 2004.
The project page was still here (2020): http://sse.cs.ucl.ac.uk/UK-OGSA/
Grid middleware is easy to install, configure, secure, debug and manage acros...Paul Brebner
A presentation made while I was managing the UK OGSA Evaluation Project in 2004, while I was on leave from CSIRO, at UCL Computer Science department, working with Wolfgang Emmerich: in which we "believe 6 impossible things before breakfast". This project encountered and partially solved many of the problems that Cloud computing finally solved.
Paul Brebner, Oxford University Computing Laboratory invited talk: "Grid middleware is easy to install, configure, debug and manage - across multiple sites (One can't believe impossible things)", 15 October 2004.
The project web site is still here (2020): http://sse.cs.ucl.ac.uk/UK-OGSA/
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
This version is a slightly shorter version of previous ones.
Google Cloud Special Edition, Sydney Data Engineering Meetup
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/269146076/
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra.
Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations.
We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes.
For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin.
To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
Updated version of presentation for 30 April 2020 Melbourne Distributed Meetup (online)
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters.
In this presentation, Paul will reveal how he architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream.
It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. Paul will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from his experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
Melbourne Big Data Meetup, March 5 2020
https://www.eventbrite.com/e/melbourne-big-data-meetup-realtime-anomaly-detection-with-cassandra-kafka-tickets-93028445585
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
Geospatial data makes it possible to leverage location, location, location! Geospatial data is taking off, as companies realize that just about everyone needs the benefits of geospatially aware applications. As a result there are no shortages of unique but demanding use cases of how enterprises are leveraging large-scale and fast geospatial big data processing. The data must be processed in large quantities - and quickly - to reveal hidden spatiotemporal insights vital to businesses and their end users. In the rush to tap into geospatial data, many enterprises will find that representing, indexing and querying geospatially-enriched data is more complex than they anticipated - and might bring about tradeoffs between accuracy, latency, and throughput.This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
0b101000 years of computing: a personal timeline - decade "0", the 1980'sPaul Brebner
With the arrival of the 2020's I realised I've now been involved in Computing for 4 decades. So I probably know more about the past of Computing that I will about the future! Here's a personal timeline of hopefully interesting things from the 1980's in Computing (at Waikato University, NZ, and UNSW in Australia).
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...Paul Brebner
Join with me in a journey of exploration upriver with "Kongo", a scalable streaming IoT logistics demonstration application using Apache Kafka, the popular open source distributed streaming platform. Along the way you'll discover: an example logistics IoT problem domain (involving the rapid movement of thousands of goods by trucks between warehouses, with real-time checking of complex business and safety rules from sensor data); an overview of the Apache Kafka architecture and components; lessons learned from making critical Kaka application design decisions; an example of Kafka Streams for checking truck load limits; and finish the journey by overcoming final performance challenges and shooting the rapids to scale Kongo on a production Kafka cluster.
https://aceu19.apachecon.com/session/kongo-building-scalable-streaming-iot-application-using-apache-kafka
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...Paul Brebner
As distributed applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical. Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works. In this presentation we’ll explore two complementary Open Source technologies: Prometheus for monitoring application metrics; and OpenTracing and Jaeger for distributed tracing. We’ll discover how they improve the observability of a massively scalable Anomaly Detection system - an application which is built around Apache Cassandra and Apache Kafka for the data layers, and dynamically deployed and scaled on Kubernetes, a container orchestration technology. We will give an overview of Prometheus and OpenTracing/Jaeger, explain how the application is instrumented, and describe how Prometheus and OpenTracing are deployed and configured in a production environment running Kubernetes, to dynamically monitor the application at scale. We conclude by exploring the benefits of monitoring and tracing technologies for understanding, debugging and tuning complex dynamic distributed systems built on Kafka, Cassandra and Kubernetes, and introduce a new use case to enable Cassandra Elastic Autoscaling, by combining Prometheus alerts, Instaclustr’s Provisioning API for Dynamic Resizing, and the new Prometheus monitoring API.
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
As distributed cloud applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical.
Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works.
We’ll explore two complementary Open Source technologies:
Prometheus for monitoring application metrics, and
OpenTracing and Jaeger for distributed tracing.
We’ll discover how they improve the observability of
an Anomaly Detection application, deployed on AWS Kubernetes, and using Instaclustr managed Apache Cassandra and Kafka clusters.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
2. What is Kafka?
Message flow
Distributed streams
Processing System
Messages sent by
Distributed Producers
to
Distributed Consumers
Consumers
via
Distributed Kafka Cluster
Cluster
3. Kafka Benefits
Fast
Scalable
Reliable
Durable
Open Source
Managed Service
Kafka benefits
4. Kafka Benefits
Fast – high throughput and low latency
Scalable – horizontally scalable with nodes and partitions
Reliable – distributed and fault tolerant
Durable - zero data loss, messages persisted to disk with immutable log
Open Source – An Apache project
Available as a Managed Service - on multiple cloud platforms
Kafka benefits
14. “Poste Restante”?
• Not a post office in a
restaurant
• General delivery (in the
US)
• The mail is delivered to
a post office, they hold
it for you until you call
15. Consumers “poll” for messages by visiting the
Poste Restante counter at the post office
17. Benefits include
• Disconnected delivery –
consumer doesn’t need to
be available to receive
messages
• Less effort for the
messaging service – only
has to deliver to a few
locations not every
consumer
• Can scale better and
handle more complex
delivery semantics!
29. Santa
North PoleTopic
Timestamp, offset, partition
Key -> Partition (optional)
Kafka Producers and Consumers
need a serializer and de-serializer
to write & read key and value
30. • Kafka doesn’t look
at the value
• Consumer can
read value
• And try to make
sense of the
message
• What will Santa
be delivering?!
52. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer Group
Consumer
Consumer
Consumer
Consumer
Consumers subscribed to topic are allocated partitions
They will only get messages from their allocated partitions.
53. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer
Consumer
Consumer
Consumers share work
within groups
Consumers in the same group share the work around
Each consumer gets only a subset of messages
54. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer Group
Consumer
Consumer
Consumer
Consumer
Messages are duplicated across
Consumer groups
Multiple groups enable message broadcasting
Messages are duplicated across groups, each consumer
group receives a copy of each message.
55. Key
Partition based delivery
Which messages are delivered to
which consumers?
If a message has a key, then Kafka
uses Partition based delivery.
Messages with the same key are
always sent to the same partition
and therefore the same
consumer.
And the order is guaranteed.
56. No Key
Round robin delivery
If the key is null, then
Kafka uses round robin
delivery
Each message is delivered
to the next partition
59. Consumer Group = Nerds
Multiple consumers
Consumer Group = Hairy
Single consumer
60. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Group “Nerds”
Group “Hairy”
Consumer 1 (Bill)
Consumer 2 (Paul)
Consumer n
Consumer 1 (Chewy)
Consumers
Subscribed to “Parties”
No Key
Round Robin
M1
M2
etc
M1
M1
M1
M2
M2
M2
Case 1: No Key
Message (M1, M2, etc) sent to the next partition
All consumers allocated to that partition will receive a message when they poll next.
62. 1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
Subscribe to “Parties” Subscribe to “Parties”
63. 1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
2. Producer sends record “Cool pool party – Invitation”
<key=null, value=“Cool pool party - Invitation”> to “parties” topic (no key)
64. 1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
2. Producer sends record “Cool pool party - Invitation”> to “parties” topic
3. Bill and Chewbacca receive a copy of the invitation and plan to attend
65. 4. Producer sends another record “Cool pool party – Cancelled”
<key=null, value=“Cool pool party - Cancelled”> to “parties” topic
66. 4. Producer sends another record <key=null, value=“Cool pool party - Cancelled”> to “parties” topic
5. Paul and Chewbacca receive the cancellation.
Paul gets the message this time as it’s round robin, ignores it as he didn’t get the invitation. Bill wastes his
time trying to go to cancelled party. The rest of the gang aren’t surprised at not receiving any party invites and
stay at home to do some hacking. Chewy is only consumer in his group so gets all messages, plans something
fun instead…
67.
68. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Group “Nerds”
Group “Hairy”
Consumer 1 (Bill)
Consumer 2 (Paul)
Consumer n
Consumer 1 (Chewy)
Consumers
Subscribed to “Parties”
Key
Hashed to partition
M1, M2
etc
M1, M2
M1, M2
M1, M2
M3
M3
M3
M3
Case 2: If there is a Key
A key is hashed to a partition, and a Message with that key is always sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.
69. Here’s what happens with a key: key is “title” of the message (e.g. “Cool pool party”)
Same set up as before:
1. Both Groups subscribe to Topic “parties” (11 partitions).
70. 1. Both Groups subscribe to Topic “parties” (11 partitions).
2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
71. 1. Both Groups subscribe to Topic “parties” (11 partitions).
2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
3. As before Bill and Chewbacca receive a copy of the invitation and plan to attend
72. 4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
73. 4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
5. Bill and Chewbacca receive the cancellation (same consumers this time, as identical key)
74. 6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
75. 6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
7. Paul and Chewy receive the invitation
Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is
allocated to
Chewy is the only consumer in his group so he gets every record no matter what partition it’s sent to
79. Real-time data pipeline
Read-time data pipeline features:
• Ingestion of multiple heterogeneous sources
• Sending data to multiple heterogeneous sinks
• Acts as a buffer to smooth out load spikes
• Enables use cases which reprocess data (e.g. disaster recovery)
80. Anomaly Detection Pipeline
Real-time Event processing pipeline:
• Simple event driven applications (If X then Y…)
• May write and read from other data sources (e.g. Cassandra)
• New Events sent back to Kafka or to other systems
• E.g. Anomaly Detection, check out my current blog series if you are interested in this example.
81. Kafka Streams Processing (Kongo IoT Blog series)
Streams processing features:
• Complex streams processing (multiple events and streams)
• Time, windows, and transformations
• Uses Kafka Streams API, includes state store
• Visualization of the streams topology
• Continuously computes the loads for trucks and checks if they are overloaded.
82. Linkedin - Before Kafka (BK)
A real example from Linkedin, who developed Kafka.
Before Kafka they had spaghetti integration of monolithic applications.
To accommodate growing membership and increasing site complexity, they migrated from a monolithic
application infrastructure to one based on microservices, which made the integration even more complex!
83. After Kafka (AK)
Rather than maintaining and scaling each pipeline individually, they invested in the
development of a single, distributed pub-sub platform - Kafka was born.
The main benefit was better Service decoupling and independent scaling.
84. The End (of the introduction) -
Find out more
Apache Kafka: https://kafka.apache.org/
Instaclustr blogs
• Mix of Cassandra, Spark, Zeppelin and Kafka
https://www.instaclustr.com/paul-brebner/
• Kafka introduction
https://insidebigdata.com/2018/04/12/developing-deeper-understanding-apache-kafka-architecture/
https://insidebigdata.com/2018/04/19/developing-deeper-understanding-apache-kafka-architecture-part-2-
• Kongo – Kafka IoT logistics application blog series
https://www.instaclustr.com/instaclustr-kongo-iot-logistics-streaming-demo-application/
• Anomaly detection with Kafka and Cassandra (and Kubernetes), current blog series
https://www.instaclustr.com/anomalia-machina-1-massively-scalable-anomaly-detection-with-apache-kafka-
Instaclustr’s Managed Kafka (Free trial)
https://www.instaclustr.com/solutions/managed-apache-kafka/
Editor's Notes
What is Kafka? Kafka is a distributed streams processing system, it allows distributed producers to send messages to distributed consumers via a Kafka cluster.
Kafka benefits
Fast – high throughput and low latency
Scalable – horizontally scalable, just add nodes and partitions
Reliable – distributed and fault tolerant
Zero data loss – messages are persisted to disk with immutable log
Open Source – An Apache project
Available as a Managed Service - on multiple cloud platforms
Kafka benefits
Fast – high throughput and low latency
Scalable – horizontally scalable, just add nodes and partitions
Reliable – distributed and fault tolerant
Zero data loss – messages are persisted to disk with immutable log
Open Source – An Apache project
Available as a Managed Service - on multiple cloud platforms
Back to the intro Kafka overview diagram, it’s a bit monochrome and boring. This talk will be more colourful and it’s going to be an extended story…
Back to the intro Kafka overview diagram, it’s a bit monochrome and boring. This talk will be more colourful and it’s going to be an extended story…
Let’s build a modern day fully electronic postal service – the Kafka Postal Service
What does a postal service do? It sends messages from A to B (animated with sounds, click!)
A is a producer, sends a message…
To B, the consumer.
Actually, no. Due to the decline in “snail mail” volumes, direct deliveries have been cancelled.
Actually, no. Due to the decline in “snail mail” volumes, direct deliveries have been cancelled.
Instead we have “Poste Restante” - Not a post office in a restaurant
It’s called general delivery (in the US).
The mail is delivered to a post office, and they hold it for you until you call for it.
Consumers poll for messages by visiting the counter at the post office.
Consumers poll for messages by visiting the counter at the post office.
Kafka topics act like a Post Office. What are the benefits?
Disconnected delivery – consumer doesn’t need to be available to receive messages
Less effort for the messaging service – only has to delivery to a few locations not larger number of consumer addresses
Can scale better and handle more complex delivery semantics!
Kafka topics act like a Post Office. What are the benefits?
Disconnected delivery – consumer doesn’t need to be available to receive messages
Less effort for the messaging service – only has to delivery to a few locations not larger number of consumer addresses
Can scale better and handle more complex delivery semantics!
First lets see how it scales. What if there are many consumers for a topic?
A single counter introduces delays and limits concurrency
More counters increases concurrency and can reduce delays
Kafka Topics have 1 or more Partitions, partitions function like multiple counters and enable high concurrency.
Before looking at delivery semantics what does a message actually look like? In Kafka a message is called a Record and is like a letter.
The topic is the destination.
The “Postmark” includes a timestamp, offset in the topic, and the partition it was sent to. Time semantics are flexible, either the time of event creation, ingestion, or processing.
The “Postmark” includes a timestamp, offset in the topic, and the partition it was sent to. Time semantics are flexible, either the time of event creation, ingestion, or processing.
There’s also a thing called a Key, which is is optional.
It refines the destination so it’s a bit like the rest of the address. We want this letter sent to Santa not just any Elf.
There’s also a thing called a Key, which is is optional.
It refines the destination so it’s a bit like the rest of the address. We want this letter sent to Santa not just any Elf.
The value is the contents (just a byte array). Kafka Producers and consumers need to have a shared serializer and de-serializer for both the key and value.
The value is the contents (just a byte array). Kafka Producers and consumers need to have a shared serializer and de-serializer for both the key and value.
Kafka doesn’t look inside the value, but the Producer and Consumer can, and the Consumer can try and make sense of the message…
I wonder what Santa will be delivering?
Next lets look at delivery semantics. For example, do we care if the message actually arrives or not?
Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
How does Kafka guarantee delivery?
A Message (M1) is written to a broker (2)
How does Kafka guarantee delivery?
A Message (M1) is written to a broker (2)
and the message is always persisted to disk.
This makes it resilient to loss of power.
The message is also replicated on multiple “brokers” , 3 is typical
And makes it resilient to loss of most servers
Finally the producer gets acknowledgement once the message is persisted and replicated (configurable for number, and sync or async)
The message is now available from more than one broker in case some fail.
This also increases the read concurrency as partitions are spread over multiple brokers.
Finally the producer gets acknowledgement once the message is persisted and replicated (configurable for number, and sync or async)
The message is now available from more than one broker in case some fail.
This also increases the read concurrency as partitions are spread over multiple brokers.
Now let’s look at the 2nd aspect of delivery semantics.
Who gets the messages and how many times are messages delivered?
Kafka is “pub-sub”, It’s loosely coupled, producers and consumers don’t know about each other
Now let’s look at the 2nd aspect of delivery semantics.
Who gets the messages and how many times are messages delivered?
Kafka is “pub-sub”, It’s loosely coupled, producers and consumers don’t know about each other
Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
Just a few more details and we can see how this works. Partitions and consumer groups enable sharing of work across multiple consumers, the more partitions a topic has the more consumers it supports
Kafka supports delivery of the same message to multiple consumers. Kafka doesn’t throw messages away immediately they are delivered, so the same message can easily be delivered to multiple consumer groups.
Consumers subscribed to ”parties” topic are allocated partitions. When they poll they will only get messages from their allocated partitions.
This enables consumers in the same group to share the work around. Each consumer gets only a subset of the available messages.
Multiple groups enable message broadcasting. Messages are duplicated across groups, as each consumer group receives a copy of each message.
Which messages are delivered to which consumers? The final aspect of delivery semantics is to do with message keys.
If a message has a key, then Kafka uses Partition based delivery.
Messages with the same key are always sent to the same partition and therefore the same consumer.
And the order is guaranteed.
If the key is null, then Kafka uses round robin delivery. Each message is delivered to the next partition.
Let’s look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
Let’s look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
Let’s look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
Looking at the case where there’s No Key 1st, each message (1, 2, etc) is sent to the next partition, and all consumers allocated to that partition will receive the message when they poll next.
Here’s what actually happens. We’re not showing the producer or topic for simplicity. You’ll have to imagine them.
Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
2 Producer sends record with the value “Cool pool party - Invitation” to “parties” topic (there’s no key)
3 Bill and Chewbacca receive a copy of the invitation and plan to attend
4. Producer sends another record with the value “Cool pool party – Cancelled” to “parties” topic
In the Nerds group, Paul gets the message this time as it’s round robin, and Chewy gets it as he’s the only consumer in his group.
Paul ignores it as he didn’t get the original invite
Bill wastes his time trying to go
The rest of the gang aren’t surprised at not receiving any invites and stay home to do some hacking
Chewy plans something else fun instead...
A visit to the hairdressers!
How does it work if there is a Key? The key is hashed to a partition, and the Message is sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.
Here’s what happens with a key, assuming that the key is the “title” of the message (“Cool pool party”)
As before Both Groups subscribe to Topic “parties” (11 partitions).
Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
Here’s what happens with a key, assuming that the key is the “title” of the message (“Cool pool party”)
As before Both Groups subscribe to Topic “parties” (11 partitions).
Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
As before, Bill and Chewbacca receive a copy of the invitation and plan to attend
4. Producer sends another record with the same key but a cancellation value to “parties” topic
This time, Bill and Chewbacca receive the cancellation (the same consumers as the key is identical)
Producer sends out another invitation to a Halloween party. The key is different this time.
Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is allocated
Chewy is the only consumer in his group so he gets every record no matter what partition it’s sent to
This time Chewy gets dressed up and goes to the party.
Who did you imagine was producing the invitations? Maybe this fellow.
Here’s a UML-like diagram with the main Kafka components. We’ve introduced Producers, Topics, Partitions, Consumer Groups and Consumes today.
There’s a lot more to explore, including how Kafka provides replication, and the Connect and Streaming APIs.
To finish up here are three important use cases for Kafka
The Real-time Data pipeline features
Ingestion of multiple heterogeneous sources
Sending data to multiple heterogeneous sinks
Acts as a buffer to smooth out load spikes
Enables use cases which reprocess data (e.g. disaster recovery)
Real-time Event processing features:
Simple event driven applications (If X then Y…)
May write and read from other data sources (e.g. Cassandra)
New Events sent back to Kafka or to other systems
E.g. Anomaly Detection, check out my current blog series if you are interested in this example.
Streams processing features:
Complex streams processing (multiple events and streams)
Time, windows, and transformations
Streams API, includes state store
This a visualization of the streams topology from the streams processing pipeline from my previous blog series, Kongo, a Kafka IoT logistics application.
It continuously computes the loads for trucks and checks if they are overloaded.
Here’s a real example from Linkedin, who developed Kafka. Before Kafka they had spaghetti integration of monolithic applications.
To accommodate growing membership and increasing site complexity, they migrated from a monolithic application infrastructure to one based on microservices, which made the integration even more complex.
After Kafka
Rather than maintaining and scaling each pipeline individually, they invested in the development of a single, distributed pub-sub platform. Thus, Kafka was born. This main benefit was better Service decoupling and independent scaling.