Slack processes over 1.2 trillion messages written and 3.4 trillion messages read daily across its real-time messaging platform, generating around 1 petabyte of streaming data. With thousands of engineers and tens of thousands of producer processes, Slack relies on Apache Kafka as the commit log for its distributed database to handle its massive scale of real-time messaging.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka is a distributed publish-subscribe messaging system that can handle high volumes of data and enable messages to be passed from one endpoint to another. It uses a distributed commit log that allows messages to be persisted on disk for durability. Kafka is fast, scalable, fault-tolerant, and guarantees zero data loss. It is used by companies like LinkedIn, Twitter, and Netflix to handle high volumes of real-time data and streaming workloads.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Kafka Streams: What it is, and how to use it?confluent
Kafka Streams is a client library for building distributed applications that process streaming data stored in Apache Kafka. It provides a high-level streams DSL that allows developers to express streaming applications as set of processing steps. Alternatively, developers can use the lower-level processor API to implement custom business logic. Kafka Streams handles tasks like fault-tolerance, scalability and state management. It represents data as streams for unbounded data or tables for bounded state. Common operations include transformations, aggregations, joins and table operations.
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments
Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. This session gives an overview of several scenarios that may require multi-cluster solutions and discusses real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka.
Key takeaways:
In many scenarios, one Kafka cluster is not enough. Understand different architectures and alternatives for multi-cluster deployments.
Zero data loss and high availability are two key requirements. Understand how to realize this, including trade-offs.
Learn about features and limitations of Kafka for multi cluster deployments
Global Kafka and mission-critical multi-cluster deployments with zero data loss and high availability became the normal, not an exception.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka is a distributed publish-subscribe messaging system that can handle high volumes of data and enable messages to be passed from one endpoint to another. It uses a distributed commit log that allows messages to be persisted on disk for durability. Kafka is fast, scalable, fault-tolerant, and guarantees zero data loss. It is used by companies like LinkedIn, Twitter, and Netflix to handle high volumes of real-time data and streaming workloads.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Kafka Streams: What it is, and how to use it?confluent
Kafka Streams is a client library for building distributed applications that process streaming data stored in Apache Kafka. It provides a high-level streams DSL that allows developers to express streaming applications as set of processing steps. Alternatively, developers can use the lower-level processor API to implement custom business logic. Kafka Streams handles tasks like fault-tolerance, scalability and state management. It represents data as streams for unbounded data or tables for bounded state. Common operations include transformations, aggregations, joins and table operations.
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments
Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. This session gives an overview of several scenarios that may require multi-cluster solutions and discusses real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka.
Key takeaways:
In many scenarios, one Kafka cluster is not enough. Understand different architectures and alternatives for multi-cluster deployments.
Zero data loss and high availability are two key requirements. Understand how to realize this, including trade-offs.
Learn about features and limitations of Kafka for multi cluster deployments
Global Kafka and mission-critical multi-cluster deployments with zero data loss and high availability became the normal, not an exception.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Kafka Streams State Stores Being Persistentconfluent
This document discusses Kafka Streams state stores. It provides examples of using different types of windowing (tumbling, hopping, sliding, session) with state stores. It also covers configuring state store logging, caching, and retention policies. The document demonstrates how to define windowed state stores in Kafka Streams applications and discusses concepts like grace periods.
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
This document provides an overview of Apache Kafka. It begins with defining Kafka as a distributed streaming platform and messaging system. It then lists the agenda which includes what Kafka is, why it is used, common use cases, major companies that use it, how it achieves high performance, and core concepts. Core concepts explained include topics, partitions, brokers, replication, leaders, and producers and consumers. The document also provides examples to illustrate these concepts.
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...HostedbyConfluent
When choosing an event streaming platform, Kafka shouldn’t be the only technology you look at. There are a plethora of others in the messaging space today, including open source and proprietary software as well as a range of cloud services. So how do you know you are choosing the right one? A great way to deepen our understanding of event streaming and Kafka is exploring the trade-offs in distributed system design and learning about the choices made by the Kafka project. We’ll look at how Kafka stacks up against other technologies in the space, including traditional messaging systems like Apache ActiveMQ and RabbitMQ as well as more contemporary ones, such as BookKeeper derivatives like Apache Pulsar or Pravega. This talk focuses on the technical details such as difference in messaging models, how data is stored locally as well as across machines in a cluster, when (not) to add tiers to your system, and more. By the end of the talk, you should have a good high-level understanding of how these systems compare and which you should choose for different types of use cases.
Integrating Apache Kafka Into Your Environmentconfluent
Watch this talk here: https://www.confluent.io/online-talks/integrating-apache-kafka-into-your-environment-on-demand
Integrating Apache Kafka with other systems in a reliable and scalable way is a key part of an event streaming platform. This session will show you how to get streams of data into and out of Kafka with Kafka Connect and REST Proxy, maintain data formats and ensure compatibility with Schema Registry and Avro, and build real-time stream processing applications with Confluent KSQL and Kafka Streams.
This session is part 4 of 4 in our Fundamentals for Apache Kafka series.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job?
This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.
No matter if you think about open source Apache Kafka, a cloud service like Confluent Cloud, or another technology using the Kafka protocol like Redpanda or Pulsar, check out this slide deck.
A detailed article about this topic:
https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/
Spring Boot+Kafka: the New Enterprise PlatformVMware Tanzu
This document discusses how Spring Boot and Kafka can form the basis of a new enterprise application platform focused on continuous delivery, event-driven architectures, and streaming data. It provides examples of companies that have successfully adopted this approach, such as Netflix transitioning to Spring Boot and a banking brand building a new core banking system using Spring Streams and Kafka. The document advocates an "event-first" and microservices-oriented mindset enabled by a streaming data platform and suggests that Spring Boot, Kafka, and related technologies provide a turnkey solution for implementing this new application development approach at large enterprises.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka and Confluent are nice, but what about the integration with public clouds like Azure. Or even better, to integrate Kafka and Confluent with a managed API management like Azure API Gateway.
In this talk I will show you how it is possible to integrate an event streaming platform like Confluent into an enterprise API Management and different other services to build up a lambda based data platform architecture.
Getting Started with Confluent Schema Registryconfluent
Getting started with Confluent Schema Registry, Patrick Druley, Senior Solutions Engineer, Confluent
Meetup link: https://www.meetup.com/Cleveland-Kafka/events/272787313/
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics. Producers write data to topics and consumers read from topics. The data is partitioned and replicated across clusters of machines called brokers for reliability and scalability. A common data format like Avro can be used to serialize the data.
This document provides an introduction to Apache Kafka. It describes Kafka as a distributed messaging system with features like durability, scalability, publish-subscribe capabilities, and ordering. It discusses key Kafka concepts like producers, consumers, topics, partitions and brokers. It also summarizes use cases for Kafka and how to implement producers and consumers in code. Finally, it briefly outlines related tools like Kafka Connect and Kafka Streams that build upon the Kafka platform.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Kafka Streams State Stores Being Persistentconfluent
This document discusses Kafka Streams state stores. It provides examples of using different types of windowing (tumbling, hopping, sliding, session) with state stores. It also covers configuring state store logging, caching, and retention policies. The document demonstrates how to define windowed state stores in Kafka Streams applications and discusses concepts like grace periods.
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
This document provides an overview of Apache Kafka. It begins with defining Kafka as a distributed streaming platform and messaging system. It then lists the agenda which includes what Kafka is, why it is used, common use cases, major companies that use it, how it achieves high performance, and core concepts. Core concepts explained include topics, partitions, brokers, replication, leaders, and producers and consumers. The document also provides examples to illustrate these concepts.
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...HostedbyConfluent
When choosing an event streaming platform, Kafka shouldn’t be the only technology you look at. There are a plethora of others in the messaging space today, including open source and proprietary software as well as a range of cloud services. So how do you know you are choosing the right one? A great way to deepen our understanding of event streaming and Kafka is exploring the trade-offs in distributed system design and learning about the choices made by the Kafka project. We’ll look at how Kafka stacks up against other technologies in the space, including traditional messaging systems like Apache ActiveMQ and RabbitMQ as well as more contemporary ones, such as BookKeeper derivatives like Apache Pulsar or Pravega. This talk focuses on the technical details such as difference in messaging models, how data is stored locally as well as across machines in a cluster, when (not) to add tiers to your system, and more. By the end of the talk, you should have a good high-level understanding of how these systems compare and which you should choose for different types of use cases.
Integrating Apache Kafka Into Your Environmentconfluent
Watch this talk here: https://www.confluent.io/online-talks/integrating-apache-kafka-into-your-environment-on-demand
Integrating Apache Kafka with other systems in a reliable and scalable way is a key part of an event streaming platform. This session will show you how to get streams of data into and out of Kafka with Kafka Connect and REST Proxy, maintain data formats and ensure compatibility with Schema Registry and Avro, and build real-time stream processing applications with Confluent KSQL and Kafka Streams.
This session is part 4 of 4 in our Fundamentals for Apache Kafka series.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job?
This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.
No matter if you think about open source Apache Kafka, a cloud service like Confluent Cloud, or another technology using the Kafka protocol like Redpanda or Pulsar, check out this slide deck.
A detailed article about this topic:
https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/
Spring Boot+Kafka: the New Enterprise PlatformVMware Tanzu
This document discusses how Spring Boot and Kafka can form the basis of a new enterprise application platform focused on continuous delivery, event-driven architectures, and streaming data. It provides examples of companies that have successfully adopted this approach, such as Netflix transitioning to Spring Boot and a banking brand building a new core banking system using Spring Streams and Kafka. The document advocates an "event-first" and microservices-oriented mindset enabled by a streaming data platform and suggests that Spring Boot, Kafka, and related technologies provide a turnkey solution for implementing this new application development approach at large enterprises.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka and Confluent are nice, but what about the integration with public clouds like Azure. Or even better, to integrate Kafka and Confluent with a managed API management like Azure API Gateway.
In this talk I will show you how it is possible to integrate an event streaming platform like Confluent into an enterprise API Management and different other services to build up a lambda based data platform architecture.
Getting Started with Confluent Schema Registryconfluent
Getting started with Confluent Schema Registry, Patrick Druley, Senior Solutions Engineer, Confluent
Meetup link: https://www.meetup.com/Cleveland-Kafka/events/272787313/
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics. Producers write data to topics and consumers read from topics. The data is partitioned and replicated across clusters of machines called brokers for reliability and scalability. A common data format like Avro can be used to serialize the data.
This document provides an introduction to Apache Kafka. It describes Kafka as a distributed messaging system with features like durability, scalability, publish-subscribe capabilities, and ordering. It discusses key Kafka concepts like producers, consumers, topics, partitions and brokers. It also summarizes use cases for Kafka and how to implement producers and consumers in code. Finally, it briefly outlines related tools like Kafka Connect and Kafka Streams that build upon the Kafka platform.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
In our exclusive webinar, you'll learn why event-driven architecture is the key to unlocking cost efficiency, operational effectiveness, and profitability. Gain insights on how this approach differs from API-driven methods and why it's essential for your organization's success.
Santander Stream Processing with Apache Flinkconfluent
Flink is becoming the de facto standard for stream processing due to its scalability, performance, fault tolerance, and language flexibility. It supports stream processing, batch processing, and analytics through one unified system. Developers choose Flink for its robust feature set and ability to handle stream processing workloads at large scales efficiently.
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
In today's data-driven world, the Internet of Things (IoT) is revolutionizing industries and unlocking new possibilities. Join Data Reply, Confluent, and Imply as we unveil a comprehensive solution for IoT that harnesses the power of real-time insights.
Workshop híbrido: Stream Processing con Flinkconfluent
El Stream processing es un requisito previo de la pila de data streaming, que impulsa aplicaciones y pipelines en tiempo real.
Permite una mayor portabilidad de datos, una utilización optimizada de recursos y una mejor experiencia del cliente al procesar flujos de datos en tiempo real.
En nuestro taller práctico híbrido, aprenderás cómo filtrar, unir y enriquecer fácilmente datos en tiempo real dentro de Confluent Cloud utilizando nuestro servicio Flink sin servidor.
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
Our talk will explore the transformative impact of integrating Confluent, HiveMQ, and SparkPlug in Industry 4.0, emphasizing the creation of a Unified Namespace.
In addition to the creation of a Unified Namespace, our webinar will also delve into Stream Governance and Scaling, highlighting how these aspects are crucial for managing complex data flows and ensuring robust, scalable IIoT-Platforms.
You will learn how to ensure data accuracy and reliability, expand your data processing capabilities, and optimize your data management processes.
Don't miss out on this opportunity to learn from industry experts and take your business to the next level.
La arquitectura impulsada por eventos (EDA) será el corazón del ecosistema de MAPFRE. Para seguir siendo competitivas, las empresas de hoy dependen cada vez más del análisis de datos en tiempo real, lo que les permite obtener información y tiempos de respuesta más rápidos. Los negocios con datos en tiempo real consisten en tomar conciencia de la situación, detectar y responder a lo que está sucediendo en el mundo ahora.
Eventos y Microservicios - Santander TechTalkconfluent
Durante esta sesión examinaremos cómo el mundo de los eventos y los microservicios se complementan y mejoran explorando cómo los patrones basados en eventos nos permiten descomponer monolitos de manera escalable, resiliente y desacoplada.
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
This document discusses networking options and best practices for Confluent Cloud. It provides an overview of public endpoints, private link, and peering options. It then discusses best practices for private networking architectures on Azure using hub-and-spoke and private link designs. Finally, it addresses networking considerations and challenges for Kafka Connect managed connectors, as well as planned enhancements for DNS peering and outbound private link support.
Purpose of the session is to have a dive into Apache, Kafka, Data Streaming and Kafka in the cloud
- Dive into Apache Kafka
- Data Streaming
- Kafka in the cloud
Build real-time streaming data pipelines to AWS with Confluentconfluent
Traditional data pipelines often face scalability issues and challenges related to cost, their monolithic design, and reliance on batch data processing. They also typically operate under the premise that all data needs to be stored in a single centralized data source before it's put to practical use. Confluent Cloud on Amazon Web Services (AWS) provides a fully managed cloud-native platform that helps you simplify the way you build real-time data flows using streaming data pipelines and Apache Kafka.
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
No matter whether you are migrating your Kafka cluster to Confluent Cloud, running a cloud-hybrid environment or are in a different situation where data protection and encryption of sensitive information is required, Confluent Service Mesh allows you to transparently encrypt your data without the need to make code changes to you existing applications.
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
Microservices have become a dominant architectural paradigm for building systems in the enterprise, but they are not without their tradeoffs. Learn how to build event-driven microservices with Apache Kafka
Confluent & GSI Webinars series - Session 3confluent
An in depth look at how Confluent is being used in the financial services industry. Gain an understanding of how organisations are utilising data in motion to solve common problems and gain benefits from their real time data capabilities.
It will look more deeply into some specific use cases and show how Confluent technology is used to manage costs and mitigate risks.
This session is aimed at Solutions Architects, Sales Engineers and Pre Sales, and also the more technically minded business aligned people. Whilst this is not a deeply technical session, a level of knowledge around Kafka would be helpful.
This document discusses moving to an event-driven architecture using Confluent. It begins by outlining some of the limitations of traditional messaging middleware approaches. Confluent provides benefits like stream processing, persistence, scalability and reliability while avoiding issues like lack of structure, slow consumers, and technical debt. The document then discusses how Confluent can help modernize architectures, enable new real-time use cases, and reduce costs through migration. It provides examples of how companies like Advance Auto Parts and Nord/LB have benefitted from implementing Confluent platforms.
This session will show why the old paradigm does not work and that a new approach to the data strategy needs to be taken. It aims to show how a Data Streaming Platform is integral to the evolution of a company’s data strategy and how Confluent is not just an integration layer but the central nervous system for an organisation
Vous apprendrez également à :
• Créer plus rapidement des produits et fonctionnalités à l’aide d’une suite complète de connecteurs et d’outils de gestion des flux, et à connecter vos environnements à des pipelines de données
• Protéger vos données et charges de travail les plus critiques grâce à des garanties intégrées en matière de sécurité, de gouvernance et de résilience
• Déployer Kafka à grande échelle en quelques minutes tout en réduisant les coûts et la charge opérationnelle associés
Confluent Partner Tech Talk with Synthesisconfluent
A discussion on the arduous planning process, and deep dive into the design/architectural decisions.
Learn more about the networking, RBAC strategies, the automation, and the deployment plan.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...kalichargn70th171
In today's business landscape, digital integration is ubiquitous, demanding swift innovation as a necessity rather than a luxury. In a fiercely competitive market with heightened customer expectations, the timely launch of flawless digital products is crucial for both acquisition and retention—any delay risks ceding market share to competitors.
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
Preparing Non - Technical Founders for Engaging a Tech AgencyISH Technologies
Preparing non-technical founders before engaging a tech agency is crucial for the success of their projects. It starts with clearly defining their vision and goals, conducting thorough market research, and gaining a basic understanding of relevant technologies. Setting realistic expectations and preparing a detailed project brief are essential steps. Founders should select a tech agency with a proven track record and establish clear communication channels. Additionally, addressing legal and contractual considerations and planning for post-launch support are vital to ensure a smooth and successful collaboration. This preparation empowers non-technical founders to effectively communicate their needs and work seamlessly with their chosen tech agency.Visit our site to get more details about this. Contact us today www.ishtechnologies.com.au
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
Introduction To Streaming Data and Stream Processing with Apache Kafka
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73. • Everything in the company is a real-time stream
• > 1.2 trillion messages written per day
• > 3.4 trillion messages read per day
• ~ 1 PB of stream data
• Thousands of engineers
• Tens of thousands of producer processes
• Used as commit log for distributed database
74.
75.
76.
77. Coming Up Next
Date Title Speaker
10/6 Deep Dive Into Apache Kafka Jun Rao
10/27 Data Integration with Kafka Gwen Shapira
11/17 Demystifying Stream Processing Neha Narkhede
12/1 A Practical Guide To Selecting A Stream
Processing Technology
Michael Noll
12/15 Streaming in Practice: Putting Apache
Kafka in Production
Roger Hoover
Editor's Notes
Hi, I’m Jay Kreps, I’m one of the creators of Apache Kafka and also one of the co-founders of Confluent, the company driving Kafka development as well as developing Confluent Platform, the leading Kafka distribution.
Welcome to our Apache Kafka Online Talk Series.
This first talk is going to introduce Kafka and the problems it was built to solve. This is a series of talks meant to help introduce you to the world of Apache Kafka and stream processing. Along the way I’ll give pointers to areas we are going to dive into into more depth in upcoming talks.
Rather than starting off by diving into a bunch of Kafka features let me instead introduce the problem area. So what is the problem we have today that needs a new thing?
To show that let me start but just laying out the architecture for most companies.
Most applilcations are request/response (client/server)
HTTP services
OLTP databases
Key/value stores
You send a request they send back a response. These do little bits of work quickly. UI rendering is inherently this way: client sends a request to fetch the data to display the UI.
Inherently synchronous—can’t display the UI until you get back the response with the data.
The second big area is batch processing.
This is the domain of the datawarehouse and hadoop clusters.
Cron jobs.
These are usually once a day things, though you can potentially run them a little quicker.
So this the architecture we have today? What are the problems?
How does data get around?
Database data, log data
Lots of systems—databases, specialized system like search, caches
Business units
N^2 connections
Tons of glue code to stitch it all together
Request/response is inherently synchronous.
Hard to scale.
Either big apps with huge amounts of work per request, or lots of little microservices…still all that work is synchronous.
Has to be synchronous---say you make an HTTP request but don’t wait for the response, then you don’t know if it actually happened or not.
Example: retail
Sales are synchronous—you give me money and I give you a product (or commit to ship you a product) and give you a receipt or confirmation number.
But a lot of the backend isn’t synchronous—I need to process shipments of new products, adjust prices, do inventory adjustments, re-order products, do things like analytics.
Most of these don’t make sense to do in the process of a single sale—they are asynchronous. If something gets borked in my inventory reordering process I don’t want to block sales.
These are the two problems that data streams can solve:
Data pipeline sprawl
Asynchronous services
This is what that architecture looks like relying on streaming.
Data pipelines go to the streaming platform, no longer N^2 separate pipelines.
Async apps can feed off of this as well.
Obviously that streaming box is going to be filled by Kafka.
Now let’s dive into these two areas.
Companies are real-time not batch
Event = something that happened
Record
A product was viewed, a sale occurred, a database was updated, etc
It’s a piece of data, a fact. But can also be a trigger or command (a sale occurred, so now let’s reorder).
Not specific to a particular system or service, just a fact.
Let’s look a few concrete examples to get a feel for it, first some simple ones then something a bit more complex.
Event is “a web page was viewed” or “an error occurred” or whatever you’re logging.
In fact the “log file” is totally incidental to the data being recorded—these data in the log is clearly a sequence of events.
Sensors can also be represented as event streams. The event is something like “the value of this sensor is X”
This covers a lot of instrumentation of the world, IOT use cases, logistics and vehicle positions, or even taking readings of metrics from monitoring counters or gauges in your apps. All these sensors can be captured into a stream of events.
Okay, those were the easy and obvious ones, now let’s look at something more surprising.
Databases can be thought of as streams of events!
This isn’t obvious, but it’s really important because most valuable data is stored in databases.
What do I mean that you can think of a database as a stream of events?
Well what’s the most common data representation in a database?
Table/Stream duality.
It’s a table.
A table looks something like this, a rectangle with columns, right?
In my simplified table I am just going to have two columns a primary key and a value…both of these could be made up of multiple columns in real life.
But in reality this representation of a table is a little bit over simplified because tables are always being updated (that is the whole point of database, after all). But this table is just static. How can I represent a table that is getting updated like our sensors or log files are?
Well the easy way to do it would be just dump out a full copy of the table periodically. In this picture I’ve represented a sequence of snapshots of the table as time goes by.
Now it’s a bit inefficient to take a full dump of the table over and over, right? Probably if your tables are like mine, not all your rows are getting updated all the time. An alternative that might be a bit more efficent would be to just dump out the rows that changed. This would give me a sequence of “diffs”. Now imagine I increase the frequency of this process to make the diff as small as possible. Clearly the smallest possible diff would be a single changed row.
Here I’ve listed the sequence of single changed rows, each represented by a single PUT operation (an update or insert).
Now the key thing is that if I have this sequence of changes it actually represents all the states of my table.
And, of course, that sequence of updates is a stream of events. The event is something like “the value of this primary key is now X”.
Now I can represent all these different data pipelines as event streams.
I can capture changes from a data system or application, and take that stream and feed it into another system.
That is going to be the key to solving my pipeline sprawl problem.
Instead of having N^2 different pipelines, one for each pair of systems I am going to have a central place that hosts all these event streams—the streaming platform.
This is a central way that all these systems and applications can plug in to get the streams they need.
So I can capture streams from databases, and feed them into DWH, Hadoop, monitoring and analytics systems.
They key advantage is that there is a single integration point for each thing that wants data.
Now obviously to make this work I’m going to need to ensure I have met the reliability, scalability, and latency guarantees for each of these systems.
Let’s dive into an example to see the example of this model of data.
Let’s say that we have a web app that is recording events about a product being viewed. And let’s say we are using Hadoop for analytics and want to get this data there.
In this model the web app publishes its stream of clicks to our streaming platform and Hadoop loads these. With only two systems, the only real advantage is some decoupling—the web app isn’t tied to the particular technology we are using for analytics, and the Hadoop cluster doesn’t need to be up all the time.
But the advantage is that additional uses of this data become really easy.
For example if other apps can also generate product view events, they just publish these, Hadoop doesn’t need to know there are more publishers of this type of event.
And if additional use cases arise they can be added a well. In this example there turn out to be a number of other uses for product views—analytics, recommendations, security monitoring, etc. These can all just subscribe without any need to go back and modify any of the apps that generate product views.
Okay so we talked about how streams can be used for solving the data pipeline sprawl problem. Now let’s talk about the solution to the second problem---too much synchrony.
This comes from being able to process real-time streams of data and this is called stream processing.
So what is stream processing?
Best way to think about it is as a third paradigm for programming. We talked about request/response and batch processing. Let’s dive into these a bit and use them to motivate stream processing.
HTTP/REST
All databases
Run all the time
Each request totally independent—No real ordering
Can fail individual requests if you want
Very simple!
About the future!
“Ed, the MapReduce job never finishes if you watch it like that”
Job kicks off at a certain time
Cron!
Processes all the input, produces all the input
Data is usually static
Hadoop!
DWH, JCL
Archaic but powerful. Can do analytics! Compex algorithms!
Also can be really efficient!
Inherently high latency
Generalizes request/response and batch.
Program takes some inputs and produces some outputs
Could be all inputs
Could be one at a time
Runs continuously forever!
Basically a service that processes, reacts to, or transforms streams of events.
Asynchronous so it allows us to decouple work from our request/response services.
Many of things are naturally thought of as stream processing.
Walmart blog
Now we’ve talked about these two motivations for streams---solving pipline spawl and asynchronous stream processing.
It won’t surprise anyone that when I talk about this streaming platform that enables these pipelines and processing I am talking about Apache Kafka.
So what is Kafka?
It’s a streaming platform.
Lets you publish and subscribe to streams of data, stores them realiably, and lets you process them in real time.
The second half of this talk with dive into Apache Kafka and talk about it acts as streaming platform and let’s you build real-time streaming pipelines and do stream processing.
It’s widely used and in production at thousands of companies.
Let’s walk through the the basics of Kafka and understand how it acts as a streaming platform.
Events = Record = Message
Timestamp, an optional key and a value
Key is used for partitioning. Timestamp is used for retention and processing.
Not an apache log
Different: Commit log
Stolen from distributed database internals
Key abstraction for systems, real-time processing, data integration
Formalization of a stream
Reader controls progress—unifies batch and real-time
Relate to pub/sub
World is a process/threads (total order) but no order between
Four APIs to read and write streams of events
First two are easy, the producer and consumer allow applications to read and write to Kafka.
The connect API allows building connectors that integrate Kafka with existing systems or applications.
The streams api allows stream processing on top of Kafka.
We’ll go through each of these briefly.
The producer writes (publishes) streams of events to Kafka to be stored.
Consumer reads (subscribes) to streams of events from topics.
Kafka topics are always multi-reader and can be scaled out. So in this example I have two logical consumers: A and B. Each of these logical consumers is made up of multiple physical processes, potentially running on different machines. Two processes for A and three for B.
These groups are dynamic: processes can join a group or leave a group at any time and Kafka will balance the load over the new set of processes.
So for example if one of the B processes dies, the data being consumed by that process will be transitioned to the remaining B processes automatically.
These groups are a fundamental abstraction in Kafka and they support not only groups of consumers, but also groups of connectors or stream processors.
In our streaming platform vision we had a number of apps or data systems that were integrated with Kafka. Either they are loading streams of data out of Kafka or publishing streams of data into Kafka.
If these systems are built to directly integrate with Kafka they could use the producer and consumer API. But many apps and databases simple have read and write apis, they don’t know anything about Kafka. How can we make integration with this kind of existing app or system easy? After all these systems don’t know that they need to push data into kafka or pull data out?
The answer is the Connect APIs
These APIs allow writing reusable connectors to Kafka.
A source is a connector that reads data out of the external system and publishes to Kafka.
A sink is a connector that pulls data out of Kafka and writes it to the external system.
Of course you could build this integration using the producer and consumer apis, so how is this better?
REST Apis for management
A few examples help illustrate this
We’ll dive into Kafka connect in more detail in the third installment of this talk series which goes far deeper into the practice of building streaming pipelines with Kafka.
The final API for Kafka is the streams api.
This api lets you build real time stream processing on top of Kafka.
These stream processors take input from kafka topics and either react to the input or transform it into output to output topics.
So in effect a stream processing app is basically just some code that consumes input and produces output.
So why not just use the producer and consumer APIs?
Well, it turns out there are some hard parts to doing real-time stream processing.
Add screenshot example
Add screenshot example
Companies == streams
What a retail store do
Streams
Retail
- Sales
- Shipments and logistics
- Pricing
- Re-ordering
- Analytics
- Fraud and theft
Table/Stream duality
Othing you might be thinking is that this streaming vision isn’t really different from existing technology like Enterprise Messaging Systems or Enterprise Service Buses?
So I thought it might be worth giving a quick cliff notes on how Kafka and modern stream processing technologies compare to previous generations of systems. For those really interested in this question we’re putting together a white paper that gives a much more detailed answer. But for those who just want the cliff notes I think there are three key differences.
The richness of the stream processing capabilities is a major advance over the previous generations of technoglogy
The other two difference really come from Kafka being a modern distributed system
--it scales horizontally on commodity machines
--and it gives strong guarantees for data
Let’s dive into these two a little bit.
So we’ve talked about the APIs and abstractions, in the next few slides I’ll give a preview of Kafka as a data system—the guaranatees and capabilities it has. Jun, my co-founder, will be doing a much deeper dive in this area in the next talk in this series, so if you want to learn more about how kafka works that is the thing to see. But I’ll give a quick walk through of what Kafka provides. Each of these characteristics is really essential to it’s usage as a “unniversal data pipeline” and processing technology.
First it scales well and cheaply.
You can do hundreds of MB/sec of writes per server and can have many servers
Kafka doesn’t get slower as you store more data in it
In this respect it performs a lot like a distribute file system
This is very different from existing messaging systems
Without this a lot of the “big data” workloads that kafka gets used for, which often have very high volume data streams, would not be possible or feasible.
This scalability is also really important for centralizing a lot of data streams in the same place—if that didn’t scale well it just wouldn’t be practical.
Next Kafka provides strong guarantees for data written to the cluster. Writes are replicated across multiple machines for fault tolerance, and we acknowledge the write back to the client.
All data is persisted to the filesystem.
And writes to the kafka cluster are strong ordered.
This is another difference from a traditional messaging system—they usually do a poor job of supporting strong ordering of updates with more than a single consumer.
Works as a cluster
Can replace machines without bringing down the cluster
Failures are handled transparently
Data not lost if a machine destroyed
Can scale elastically as usage grows.