Introduction To Streaming Data and Stream Processing with Apache Kafka

•

29 likes•3,841 views

Slack processes over 1.2 trillion messages written and 3.4 trillion messages read daily across its real-time messaging platform, generating around 1 petabyte of streaming data. With thousands of engineers and tens of thousands of producer processes, Slack relies on Apache Kafka as the commit log for its distributed database to handle its massive scale of real-time messaging.

Software

• Everything in the company is a real-time stream
• > 1.2 trillion messages written per day
• > 3.4 trillion messages read per day
• ~ 1 PB of stream data
• Thousands of engineers
• Tens of thousands of producer processes
• Used as commit log for distributed database

Coming Up Next
Date Title Speaker
10/6 Deep Dive Into Apache Kafka Jun Rao
10/27 Data Integration with Kafka Gwen Shapira
11/17 Demystifying Stream Processing Neha Narkhede
12/1 A Practical Guide To Selecting A Stream
Processing Technology
Michael Noll
12/15 Streaming in Practice: Putting Apache
Kafka in Production
Roger Hoover

What's hot

Spring Boot+Kafka: the New Enterprise Platform

VMware Tanzu

Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.

Kafka Tutorial - Introduction to Apache Kafka (Part 1)

Jean-Paul Azar

Presentation at Strata Data Conference 2018, New York The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure. Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker. Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.

A Deep Dive into Kafka Controller

confluent

kafka

Amikam Snir

Introduction To Flink

Knoldus Inc.

Getting Started with Confluent Schema Registry

confluent

Apache Kafka Fundamentals for Architects, Admins and Developers

confluent

At Uber, we are seeing an increasing demand for Kafka at-least-once delivery (asks=all). So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When trying to allow at-least-once producing on the regular Kafka clusters, the producing performance was the main concern. We spent some effort on this issue in the recent months, and managed to reduce at-least-once producer latency by about 80% with code changes and configuration tuning. When acks=0, these improvements also help increasing Kafka throughput and reducing Kafka end-to-end latency.

Improving Kafka at-least-once performance at Uber

Ying Zheng

Apache Kafka's rise in popularity as a streaming platform has demanded a revisit of its traditional at-least-once message delivery semantics. In this talk, we present the recent additions to Kafka to achieve exactly-once semantics (EoS) including support for idempotence and transactions in the Kafka clients. The main focus will be the specific semantics that Kafka distributed transactions enable and the underlying mechanics which allow them to scale efficiently.

Exactly-once Semantics in Apache Kafka

confluent

ksqlDB is a streaming database that uses Kafka Streams to execute queries against data in Apache Kafka®. Historically, each query was compiled into its own Kafka Streams program to be executed inside the ksqlDB servers. As ksqlDB moved to support broader and more complex use cases, this query execution strategy became the bottleneck for scaling up the number of persistent queries. This talk will examine the problems faced and how we addressed them. Using too many Kafka Streams instances requires too many resources in both threads and consumers. One way to avoid this is using Modular Topologies, which are coming to Kafka Streams in KIP-809. Modular Topologies allow us to dynamically change the workload of a Kafka Streams application while it’s running and share resources such as consumer/producer clients and processing threads. This makes it possible to use a single Kafka Streams runtime for multiple topologies that share consumers and threads across them. We will see in detail how this makes it possible for ksqlDB to consolidate queries into a shared Kafka Streams runtime. Kafka Streams developers will take away from this talk an understanding of how to utilize ModularTopologies, and dynamically upgrade their Kafka Streams workload effectively.

Using Modular Topologies in Kafka Streams to scale ksqlDB’s persistent querie...

HostedbyConfluent

Kafka 101 and Developer Best Practices

confluent

Apache Flink and what it is used for

Aljoscha Krettek

Apache kafka

NexThoughts Technologies

Apache Kafka Introduction

Amita Mirajkar

If you were to ask any developer, ""what's a schema and where is it used?"" Most likely, you'd get an answer involving a relational database. The truth is the domain objects used in applications represent a contract, an implied schema, whether developers choose to acknowledge them or not. But even if you recognize the need for a formal schema, what's the best way to manage them? This presentation will contain some theory and primarily practical application for schemas with Schema Registry. I'll briefly explain what a schema is and how it's very relevant to any application working with Kafka today. It will go into the practical, introducing Schema Registry, describing how it works and how developers can leverage it to provide schemas across an organization. The discussion will cover working with Schema Registry from the command line, how to leverage it with Kafka clients, and the supported serialization formats. Some established build tools that make life easier for the Kafka developer will also be covered. Attendees will walk away with knowledge of Schema Registry and a solid understanding of how it works, how to integrate them into Kafka clients. They'll also learn enough about the supported serialization frameworks to start implementing schemas right away in their Kafka development efforts.

Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022

HostedbyConfluent

Introduction to Apache Kafka

Shiao-An Yuan

Kafka Streams State Stores Being Persistent

confluent

Flink Forward San Francisco 2022. Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire. by Hojjat Jafarpour

Extending Flink SQL for stream processing use cases

Flink Forward

Introduction to Apache Kafka

Jeff Holoman

Apache Kafka

Diego Pacheco

What's hot (20)

Spring Boot+Kafka: the New Enterprise Platform

Kafka Tutorial - Introduction to Apache Kafka (Part 1)

A Deep Dive into Kafka Controller

kafka

Introduction To Flink

Getting Started with Confluent Schema Registry

Apache Kafka Fundamentals for Architects, Admins and Developers

Improving Kafka at-least-once performance at Uber

Exactly-once Semantics in Apache Kafka

Using Modular Topologies in Kafka Streams to scale ksqlDB’s persistent querie...

Kafka 101 and Developer Best Practices

Apache Flink and what it is used for

Apache kafka

Apache Kafka Introduction

Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022

Introduction to Apache Kafka

Kafka Streams State Stores Being Persistent

Extending Flink SQL for stream processing use cases

Introduction to Apache Kafka

Apache Kafka

More from confluent

Speed Wins: From Kafka to APIs in Minutes

confluent

Evolving Data Governance for the Real-time Streaming and AI Era

confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

confluent

Santander Stream Processing with Apache Flink

confluent

Unlocking the Power of IoT: A comprehensive approach to real-time insights

confluent

El Stream processing es un requisito previo de la pila de data streaming, que impulsa aplicaciones y pipelines en tiempo real. Permite una mayor portabilidad de datos, una utilización optimizada de recursos y una mejor experiencia del cliente al procesar flujos de datos en tiempo real. En nuestro taller práctico híbrido, aprenderás cómo filtrar, unir y enriquecer fácilmente datos en tiempo real dentro de Confluent Cloud utilizando nuestro servicio Flink sin servidor.

Workshop híbrido: Stream Processing con Flink

confluent

Our talk will explore the transformative impact of integrating Confluent, HiveMQ, and SparkPlug in Industry 4.0, emphasizing the creation of a Unified Namespace. In addition to the creation of a Unified Namespace, our webinar will also delve into Stream Governance and Scaling, highlighting how these aspects are crucial for managing complex data flows and ensuring robust, scalable IIoT-Platforms. You will learn how to ensure data accuracy and reliability, expand your data processing capabilities, and optimize your data management processes. Don't miss out on this opportunity to learn from industry experts and take your business to the next level.

Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...

confluent

La arquitectura impulsada por eventos (EDA) será el corazón del ecosistema de MAPFRE. Para seguir siendo competitivas, las empresas de hoy dependen cada vez más del análisis de datos en tiempo real, lo que les permite obtener información y tiempos de respuesta más rápidos. Los negocios con datos en tiempo real consisten en tomar conciencia de la situación, detectar y responder a lo que está sucediendo en el mundo ahora.

AWS Immersion Day Mapfre - Confluent

confluent

Eventos y Microservicios - Santander TechTalk

confluent

Q&A with Confluent Experts: Navigating Networking in Confluent Cloud

confluent

Citi TechTalk Session 2: Kafka Deep Dive

confluent

Traditional data pipelines often face scalability issues and challenges related to cost, their monolithic design, and reliance on batch data processing. They also typically operate under the premise that all data needs to be stored in a single centralized data source before it's put to practical use. Confluent Cloud on Amazon Web Services (AWS) provides a fully managed cloud-native platform that helps you simplify the way you build real-time data flows using streaming data pipelines and Apache Kafka.

Build real-time streaming data pipelines to AWS with Confluent

confluent

Q&A with Confluent Professional Services: Confluent Service Mesh

confluent

Citi Tech Talk: Event Driven Kafka Microservices

confluent

An in depth look at how Confluent is being used in the financial services industry. Gain an understanding of how organisations are utilising data in motion to solve common problems and gain benefits from their real time data capabilities. It will look more deeply into some specific use cases and show how Confluent technology is used to manage costs and mitigate risks. This session is aimed at Solutions Architects, Sales Engineers and Pre Sales, and also the more technically minded business aligned people. Whilst this is not a deeply technical session, a level of knowledge around Kafka would be helpful.

Confluent & GSI Webinars series - Session 3

confluent

Transforming applications built with traditional messaging solutions such as TIBCO, MQ and Solace to be scalable, reliable and ready for the move to cloud How can applications built with traditional messaging technologies like TIBCO, Solace and IBM MQ be modernised and be made cloud ready? What are the advantages to Event Streaming approaches to pub/sub vs traditional message queues? What are the strengeths and weaknesses of both approaches, and what use cases and requirements are actually a better fit for messaging than Kafka?

Citi Tech Talk: Messaging Modernization

confluent

Citi Tech Talk: Data Governance for streaming and real time data

confluent

Confluent & GSI Webinars series: Session 2

confluent

Vous apprendrez également à : • Créer plus rapidement des produits et fonctionnalités à l’aide d’une suite complète de connecteurs et d’outils de gestion des flux, et à connecter vos environnements à des pipelines de données • Protéger vos données et charges de travail les plus critiques grâce à des garanties intégrées en matière de sécurité, de gouvernance et de résilience • Déployer Kafka à grande échelle en quelques minutes tout en réduisant les coûts et la charge opérationnelle associés

Data In Motion Paris 2023

confluent

Confluent Partner Tech Talk with Synthesis

confluent

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes

Evolving Data Governance for the Real-time Streaming and AI Era

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Santander Stream Processing with Apache Flink

Unlocking the Power of IoT: A comprehensive approach to real-time insights

Workshop híbrido: Stream Processing con Flink

Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...

AWS Immersion Day Mapfre - Confluent

Eventos y Microservicios - Santander TechTalk

Q&A with Confluent Experts: Navigating Networking in Confluent Cloud

Citi TechTalk Session 2: Kafka Deep Dive

Build real-time streaming data pipelines to AWS with Confluent

Q&A with Confluent Professional Services: Confluent Service Mesh

Citi Tech Talk: Event Driven Kafka Microservices

Confluent & GSI Webinars series - Session 3

Citi Tech Talk: Messaging Modernization

Citi Tech Talk: Data Governance for streaming and real time data

Confluent & GSI Webinars series: Session 2

Data In Motion Paris 2023

Confluent Partner Tech Talk with Synthesis

Recently uploaded

JustNaik Solution Deck (stage bus sector)

Max Lee

Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...

rajkumar669520

COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...

naitiksharma1124

A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf

kalichargn70th171

IT Software Development Resume, Vaibhav jha 2024

vaibhav130304

Secure Software Ecosystem Teqnation 2024

Soroosh Khodami

How to install and activate eGrabber JobGrabber

eGrabber

GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates

Neo4j

Agnieszka Andrzejewska - BIM School Course in Kraków

bim.edu.pl

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Lu Qiu (Data & AI Platform Tech Lead, @Alluxio) - Siyuan Sheng (Senior Software Engineer, @Alluxio) Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub. In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn: - The data loading challenges hindering GPU utilization - The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT - Real-world examples of boosting model performance and GPU utilization through optimized data access

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio, Inc.

How to pick right visual testing tool.pdf

Testgrid.io

This is a classic migration case study (the past, current and the future) at scale from a world-wide company transitioning from Confluent Platform and Confluent Cloud to self-managed Apache Kafka on Kubernetes using Strimzi. At Maersk, we have been architecting, designing and implementing our 3rd generation Event Streaming Platform. This platform is based on Kubernetes in Azure and using Strimzi to operate Apache Kafka at large scale, highly reliable, segregating data based on isolated use cases. Our 2nd generation was based on OnPrem Confluent Platform and Confluent Cloud and this presentation is the story of this migration and reasoning behind it. Furthermore, we would get into details on how we monitor (Grafana, Prometheus), alert (GoAlert and alert as code), operate and provide self-service solutions on top of Strimzi to enable business critical application in Maersk, implemented in GoLang using the GitOps deployment model with Flux and Kustomization among others. Finally, if time allows we will end with a demo of an open-source self service tool to monitor and explore the cluster with most wanted features such as topic message browsing and configuring and restarting connectors.

StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf

steffenkarlsson2

CompTIA Security+ (Study Notes) for cs.pdf

Furqanuddin10

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Eric Wang (Software Engineer, @Uber) Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes. In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.

AI/ML Infra Meetup | ML explainability in Michelangelo

Alluxio, Inc.

SQL Injection Introduction and Prevention

Mohammed Fazuluddin

AI/ML Infra Meetup | Perspective on Deep Learning Framework

Alluxio, Inc.

Scenarios are the central artifact of the Behaviour Driven Development (BDD) process. Although many teams use scenarios and tools like Cucumber or SpecFlow to automate them, in many cases their scenarios contain a lot of details, particularly test data, and therefore they become too complicated to support collaboration with the business. The "essential" principle of scenario writing (scenario formulation) states that only those details should be included in the scenario that are relevant for the outcome. This talk provides help for those who struggle implementing this principle or would be interested to learn how you can create brief and maintainable scenarios.

Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)

Gáspár Nagy

Workforce Efficiency with Employee Time Tracking Software.pdf

DeskTrack

OpenChain @ LF Japan Executive Briefing - May 2024

Shane Coughlan

"Introduction to Windows 7" serves as the foundational chapter in our guide, setting the stage for understanding the key features and functionalities of this operating system. Windows 7, released by Microsoft in 2009, quickly became one of the most popular and widely used versions of Windows due to its user-friendly interface, stability, and performance improvements over its predecessor, Windows Vista. This chapter begins by providing an overview of the Windows 7 operating system, highlighting its key attributes and improvements compared to earlier versions of Windows. It introduces users to the visual enhancements such as Aero Glass, the revamped taskbar (also known as the Superbar), and the redesigned Start menu, which all contribute to a more intuitive and streamlined user experience. Furthermore, "Introduction to Windows 7" delves into the architecture and system requirements of the operating system, helping users understand what hardware specifications are necessary for optimal performance. It covers topics such as processor requirements, RAM, disk space, and graphics capabilities, ensuring that readers have a clear 3 understanding of the hardware prerequisites for running Windows 7 smoothly. Additionally, this chapter explores the various editions of Windows 7, including Home Premium, Professional, Ultimate, and Enterprise, outlining the differences between them and helping users choose the edition that best suits their needs and requirements. Moreover, "Introduction to Windows 7" provides an overview of the installation process, guiding users through the steps required to install or upgrade to Windows 7 on their computers. It covers topics such as preparing for installation, choosing the installation type (upgrade or custom), partitioning disks, and configuring initial settings. In summary, "Introduction to Windows 7" serves as a comprehensive primer for users who are new to the operating system or seeking to refresh their understanding. By familiarizing themselves with the core concepts and features of Windows 7, readers can lay a solid foundation for exploring more advanced topics covered in subsequent chapters of this guide.

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf

mbmh111980

Recently uploaded (20)

JustNaik Solution Deck (stage bus sector)

Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...

COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...

A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf

IT Software Development Resume, Vaibhav jha 2024

Secure Software Ecosystem Teqnation 2024

How to install and activate eGrabber JobGrabber

GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates

Agnieszka Andrzejewska - BIM School Course in Kraków

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

How to pick right visual testing tool.pdf

StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf

CompTIA Security+ (Study Notes) for cs.pdf

AI/ML Infra Meetup | ML explainability in Michelangelo

SQL Injection Introduction and Prevention

AI/ML Infra Meetup | Perspective on Deep Learning Framework

Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)

Workforce Efficiency with Employee Time Tracking Software.pdf

OpenChain @ LF Japan Executive Briefing - May 2024

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf

Introduction To Streaming Data and Stream Processing with Apache Kafka

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

67.

68.

69.

70.

71.

72.

73. • Everything in the company is a real-time stream • > 1.2 trillion messages written per day • > 3.4 trillion messages read per day • ~ 1 PB of stream data • Thousands of engineers • Tens of thousands of producer processes • Used as commit log for distributed database

74.

75.

76.

77. Coming Up Next Date Title Speaker 10/6 Deep Dive Into Apache Kafka Jun Rao 10/27 Data Integration with Kafka Gwen Shapira 11/17 Demystifying Stream Processing Neha Narkhede 12/1 A Practical Guide To Selecting A Stream Processing Technology Michael Noll 12/15 Streaming in Practice: Putting Apache Kafka in Production Roger Hoover

Editor's Notes

Hi, I’m Jay Kreps, I’m one of the creators of Apache Kafka and also one of the co-founders of Confluent, the company driving Kafka development as well as developing Confluent Platform, the leading Kafka distribution. Welcome to our Apache Kafka Online Talk Series. This first talk is going to introduce Kafka and the problems it was built to solve. This is a series of talks meant to help introduce you to the world of Apache Kafka and stream processing. Along the way I’ll give pointers to areas we are going to dive into into more depth in upcoming talks.
Rather than starting off by diving into a bunch of Kafka features let me instead introduce the problem area. So what is the problem we have today that needs a new thing? To show that let me start but just laying out the architecture for most companies.
Most applilcations are request/response (client/server) HTTP services OLTP databases Key/value stores You send a request they send back a response. These do little bits of work quickly. UI rendering is inherently this way: client sends a request to fetch the data to display the UI. Inherently synchronous—can’t display the UI until you get back the response with the data.
The second big area is batch processing. This is the domain of the datawarehouse and hadoop clusters. Cron jobs. These are usually once a day things, though you can potentially run them a little quicker. So this the architecture we have today? What are the problems?
How does data get around?
Database data, log data Lots of systems—databases, specialized system like search, caches Business units N^2 connections Tons of glue code to stitch it all together
Request/response is inherently synchronous. Hard to scale.
Either big apps with huge amounts of work per request, or lots of little microservices…still all that work is synchronous. Has to be synchronous---say you make an HTTP request but don’t wait for the response, then you don’t know if it actually happened or not.
Example: retail Sales are synchronous—you give me money and I give you a product (or commit to ship you a product) and give you a receipt or confirmation number. But a lot of the backend isn’t synchronous—I need to process shipments of new products, adjust prices, do inventory adjustments, re-order products, do things like analytics. Most of these don’t make sense to do in the process of a single sale—they are asynchronous. If something gets borked in my inventory reordering process I don’t want to block sales.
These are the two problems that data streams can solve: Data pipeline sprawl Asynchronous services
This is what that architecture looks like relying on streaming. Data pipelines go to the streaming platform, no longer N^2 separate pipelines. Async apps can feed off of this as well. Obviously that streaming box is going to be filled by Kafka. Now let’s dive into these two areas.
Companies are real-time not batch
Event = something that happened Record A product was viewed, a sale occurred, a database was updated, etc It’s a piece of data, a fact. But can also be a trigger or command (a sale occurred, so now let’s reorder). Not specific to a particular system or service, just a fact. Let’s look a few concrete examples to get a feel for it, first some simple ones then something a bit more complex.
Event is “a web page was viewed” or “an error occurred” or whatever you’re logging. In fact the “log file” is totally incidental to the data being recorded—these data in the log is clearly a sequence of events.
Sensors can also be represented as event streams. The event is something like “the value of this sensor is X” This covers a lot of instrumentation of the world, IOT use cases, logistics and vehicle positions, or even taking readings of metrics from monitoring counters or gauges in your apps. All these sensors can be captured into a stream of events. Okay, those were the easy and obvious ones, now let’s look at something more surprising.
Databases can be thought of as streams of events! This isn’t obvious, but it’s really important because most valuable data is stored in databases. What do I mean that you can think of a database as a stream of events? Well what’s the most common data representation in a database? Table/Stream duality.
It’s a table. A table looks something like this, a rectangle with columns, right? In my simplified table I am just going to have two columns a primary key and a value…both of these could be made up of multiple columns in real life. But in reality this representation of a table is a little bit over simplified because tables are always being updated (that is the whole point of database, after all). But this table is just static. How can I represent a table that is getting updated like our sensors or log files are?
Well the easy way to do it would be just dump out a full copy of the table periodically. In this picture I’ve represented a sequence of snapshots of the table as time goes by.
Now it’s a bit inefficient to take a full dump of the table over and over, right? Probably if your tables are like mine, not all your rows are getting updated all the time. An alternative that might be a bit more efficent would be to just dump out the rows that changed. This would give me a sequence of “diffs”. Now imagine I increase the frequency of this process to make the diff as small as possible. Clearly the smallest possible diff would be a single changed row. Here I’ve listed the sequence of single changed rows, each represented by a single PUT operation (an update or insert). Now the key thing is that if I have this sequence of changes it actually represents all the states of my table. And, of course, that sequence of updates is a stream of events. The event is something like “the value of this primary key is now X”.
Now I can represent all these different data pipelines as event streams. I can capture changes from a data system or application, and take that stream and feed it into another system.
That is going to be the key to solving my pipeline sprawl problem. Instead of having N^2 different pipelines, one for each pair of systems I am going to have a central place that hosts all these event streams—the streaming platform. This is a central way that all these systems and applications can plug in to get the streams they need. So I can capture streams from databases, and feed them into DWH, Hadoop, monitoring and analytics systems. They key advantage is that there is a single integration point for each thing that wants data. Now obviously to make this work I’m going to need to ensure I have met the reliability, scalability, and latency guarantees for each of these systems.
Let’s dive into an example to see the example of this model of data. Let’s say that we have a web app that is recording events about a product being viewed. And let’s say we are using Hadoop for analytics and want to get this data there. In this model the web app publishes its stream of clicks to our streaming platform and Hadoop loads these. With only two systems, the only real advantage is some decoupling—the web app isn’t tied to the particular technology we are using for analytics, and the Hadoop cluster doesn’t need to be up all the time. But the advantage is that additional uses of this data become really easy.
For example if other apps can also generate product view events, they just publish these, Hadoop doesn’t need to know there are more publishers of this type of event.
And if additional use cases arise they can be added a well. In this example there turn out to be a number of other uses for product views—analytics, recommendations, security monitoring, etc. These can all just subscribe without any need to go back and modify any of the apps that generate product views.
Okay so we talked about how streams can be used for solving the data pipeline sprawl problem. Now let’s talk about the solution to the second problem---too much synchrony. This comes from being able to process real-time streams of data and this is called stream processing. So what is stream processing?
Best way to think about it is as a third paradigm for programming. We talked about request/response and batch processing. Let’s dive into these a bit and use them to motivate stream processing.
HTTP/REST All databases Run all the time Each request totally independent—No real ordering Can fail individual requests if you want Very simple! About the future!
“Ed, the MapReduce job never finishes if you watch it like that” Job kicks off at a certain time Cron! Processes all the input, produces all the input Data is usually static Hadoop! DWH, JCL Archaic but powerful. Can do analytics! Compex algorithms! Also can be really efficient! Inherently high latency
Generalizes request/response and batch. Program takes some inputs and produces some outputs Could be all inputs Could be one at a time Runs continuously forever!
Basically a service that processes, reacts to, or transforms streams of events. Asynchronous so it allows us to decouple work from our request/response services.
Many of things are naturally thought of as stream processing. Walmart blog
Now we’ve talked about these two motivations for streams---solving pipline spawl and asynchronous stream processing. It won’t surprise anyone that when I talk about this streaming platform that enables these pipelines and processing I am talking about Apache Kafka.
So what is Kafka? It’s a streaming platform. Lets you publish and subscribe to streams of data, stores them realiably, and lets you process them in real time. The second half of this talk with dive into Apache Kafka and talk about it acts as streaming platform and let’s you build real-time streaming pipelines and do stream processing.
It’s widely used and in production at thousands of companies. Let’s walk through the the basics of Kafka and understand how it acts as a streaming platform.
Events = Record = Message Timestamp, an optional key and a value Key is used for partitioning. Timestamp is used for retention and processing.
Not an apache log Different: Commit log Stolen from distributed database internals Key abstraction for systems, real-time processing, data integration Formalization of a stream Reader controls progress—unifies batch and real-time
Relate to pub/sub
World is a process/threads (total order) but no order between
Four APIs to read and write streams of events First two are easy, the producer and consumer allow applications to read and write to Kafka. The connect API allows building connectors that integrate Kafka with existing systems or applications. The streams api allows stream processing on top of Kafka. We’ll go through each of these briefly.
The producer writes (publishes) streams of events to Kafka to be stored.
Consumer reads (subscribes) to streams of events from topics.
Kafka topics are always multi-reader and can be scaled out. So in this example I have two logical consumers: A and B. Each of these logical consumers is made up of multiple physical processes, potentially running on different machines. Two processes for A and three for B. These groups are dynamic: processes can join a group or leave a group at any time and Kafka will balance the load over the new set of processes.
So for example if one of the B processes dies, the data being consumed by that process will be transitioned to the remaining B processes automatically. These groups are a fundamental abstraction in Kafka and they support not only groups of consumers, but also groups of connectors or stream processors.
In our streaming platform vision we had a number of apps or data systems that were integrated with Kafka. Either they are loading streams of data out of Kafka or publishing streams of data into Kafka. If these systems are built to directly integrate with Kafka they could use the producer and consumer API. But many apps and databases simple have read and write apis, they don’t know anything about Kafka. How can we make integration with this kind of existing app or system easy? After all these systems don’t know that they need to push data into kafka or pull data out? The answer is the Connect APIs
These APIs allow writing reusable connectors to Kafka. A source is a connector that reads data out of the external system and publishes to Kafka. A sink is a connector that pulls data out of Kafka and writes it to the external system. Of course you could build this integration using the producer and consumer apis, so how is this better?
REST Apis for management A few examples help illustrate this
We’ll dive into Kafka connect in more detail in the third installment of this talk series which goes far deeper into the practice of building streaming pipelines with Kafka.
The final API for Kafka is the streams api. This api lets you build real time stream processing on top of Kafka. These stream processors take input from kafka topics and either react to the input or transform it into output to output topics.
So in effect a stream processing app is basically just some code that consumes input and produces output. So why not just use the producer and consumer APIs? Well, it turns out there are some hard parts to doing real-time stream processing.
Add screenshot example
Add screenshot example
Companies == streams What a retail store do Streams Retail - Sales - Shipments and logistics - Pricing - Re-ordering - Analytics - Fraud and theft
Table/Stream duality
Othing you might be thinking is that this streaming vision isn’t really different from existing technology like Enterprise Messaging Systems or Enterprise Service Buses?
So I thought it might be worth giving a quick cliff notes on how Kafka and modern stream processing technologies compare to previous generations of systems. For those really interested in this question we’re putting together a white paper that gives a much more detailed answer. But for those who just want the cliff notes I think there are three key differences.
The richness of the stream processing capabilities is a major advance over the previous generations of technoglogy The other two difference really come from Kafka being a modern distributed system --it scales horizontally on commodity machines --and it gives strong guarantees for data Let’s dive into these two a little bit.
So we’ve talked about the APIs and abstractions, in the next few slides I’ll give a preview of Kafka as a data system—the guaranatees and capabilities it has. Jun, my co-founder, will be doing a much deeper dive in this area in the next talk in this series, so if you want to learn more about how kafka works that is the thing to see. But I’ll give a quick walk through of what Kafka provides. Each of these characteristics is really essential to it’s usage as a “unniversal data pipeline” and processing technology.
First it scales well and cheaply. You can do hundreds of MB/sec of writes per server and can have many servers Kafka doesn’t get slower as you store more data in it In this respect it performs a lot like a distribute file system This is very different from existing messaging systems Without this a lot of the “big data” workloads that kafka gets used for, which often have very high volume data streams, would not be possible or feasible. This scalability is also really important for centralizing a lot of data streams in the same place—if that didn’t scale well it just wouldn’t be practical.
Next Kafka provides strong guarantees for data written to the cluster. Writes are replicated across multiple machines for fault tolerance, and we acknowledge the write back to the client. All data is persisted to the filesystem. And writes to the kafka cluster are strong ordered. This is another difference from a traditional messaging system—they usually do a poor job of supporting strong ordering of updates with more than a single consumer.
Works as a cluster Can replace machines without bringing down the cluster Failures are handled transparently Data not lost if a machine destroyed Can scale elastically as usage grows.

Introduction To Streaming Data and Stream Processing with Apache Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Introduction To Streaming Data and Stream Processing with Apache Kafka

Editor's Notes