For a long time, a substantial portion of data processing that companies did ran as big batch jobs. But businesses operate in real-time and the software they run is catching up. Today, processing data in a streaming fashion becomes more and more popular in many companies over the more "traditional" way of batch-processing big data sets available as a whole.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library.
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
We present Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang
High-speed and low footprint data stream processing is high in demand for Kafka Streams applications. However, how to write an efficient streaming application using the Streams DSL has been asked by many users in the past since it requires some deep knowledge about Kafka Streams internals. In this talk, I will talk about how to analyze your Kafka Streams applications, target performance bottlenecks and unnecessary storage costs, and optimize your application code accordingly using the Streams DSL.
In addition, I will talk about the new optimization framework that we have been developed inside Kafka Streams since the 2.1 release which replaced the in-place translation of the Streams DSL into a comprehensive process composed of streams topology compilation and rewriting phases, with a focus on reducing various storage footprints of Streams applications, such as state stores, internal topics etc.
Kafka Streams: the easiest way to start with stream processingYaroslav Tkachenko
Stream processing is getting more & more important in our data-centric systems. In the world of Big Data, batch processing is not enough anymore - everyone needs interactive, real-time analytics for making critical business decisions, as well as providing great features to the customers.
There are many stream processing frameworks available nowadays, but the cost of provisioning infrastructure and maintaining distributed computations is usually very high. Sometimes you just have to satisfy some specific requirements, like using HDFS or YARN.
Apache Kafka is de facto a standard for building data pipelines. Kafka Streams is a lightweight library (available since 0.10) that uses powerful Kafka abstractions internally and doesn't require any complex setup or special infrastructure - you just deploy it like any other regular application.
In this session I want to talk about the goals behind stream processing, basic techniques and some best practices. Then I'm going to explain main fundamental concepts behind Kafka and explore Kafka Streams syntax and streaming features. By the end of the session you'll be able to write stream processing applications in your domain, especially if you already use Kafka as your data pipeline.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library.
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
We present Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang
High-speed and low footprint data stream processing is high in demand for Kafka Streams applications. However, how to write an efficient streaming application using the Streams DSL has been asked by many users in the past since it requires some deep knowledge about Kafka Streams internals. In this talk, I will talk about how to analyze your Kafka Streams applications, target performance bottlenecks and unnecessary storage costs, and optimize your application code accordingly using the Streams DSL.
In addition, I will talk about the new optimization framework that we have been developed inside Kafka Streams since the 2.1 release which replaced the in-place translation of the Streams DSL into a comprehensive process composed of streams topology compilation and rewriting phases, with a focus on reducing various storage footprints of Streams applications, such as state stores, internal topics etc.
Kafka Streams: the easiest way to start with stream processingYaroslav Tkachenko
Stream processing is getting more & more important in our data-centric systems. In the world of Big Data, batch processing is not enough anymore - everyone needs interactive, real-time analytics for making critical business decisions, as well as providing great features to the customers.
There are many stream processing frameworks available nowadays, but the cost of provisioning infrastructure and maintaining distributed computations is usually very high. Sometimes you just have to satisfy some specific requirements, like using HDFS or YARN.
Apache Kafka is de facto a standard for building data pipelines. Kafka Streams is a lightweight library (available since 0.10) that uses powerful Kafka abstractions internally and doesn't require any complex setup or special infrastructure - you just deploy it like any other regular application.
In this session I want to talk about the goals behind stream processing, basic techniques and some best practices. Then I'm going to explain main fundamental concepts behind Kafka and explore Kafka Streams syntax and streaming features. By the end of the session you'll be able to write stream processing applications in your domain, especially if you already use Kafka as your data pipeline.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Building a Replicated Logging System with Apache KafkaGuozhang Wang
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. I am going to talk about our design
and engineering experience to replicate Kafka logs for various
distributed data-driven systems, including
source-of-truth data storage and stream processing.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...HostedbyConfluent
This talk is aimed to give developers who are interested to scale their streaming application with Exactly-Once (EOS) guarantees. Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library. We would also present how the EOS code can be simplified with plain Producer and Consumer. Come to learn more if you wish to adopt this improved EOS feature and get started on building your own EOS application today!
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...confluent
In mission-critical real time applications, using machine learning to analyze streaming data are gaining momentum. In those applications Apache Kafka is the most widely used framework to process the data streams. It typically works with other machine learning frameworks for model inference and training purposes. In this talk, our focus is to discuss the KafkaDataset module in TensorFlow. KafkaDataset processes Kafka streaming data directly to TensorFlow’s graph. As a part of Tensorflow (in ‘tf.contrib’), the implementation of KafkaDataset is mostly written in C++. The module exposes a machine learning friendly Python interface through Tensorflow’s ‘tf.data’ API. It could be directly feed to ‘tf.keras’ and other TensorFlow modules for training and inferencing purposes. Combined with Kafka streaming itself, the KafkaDataset module in TensorFlow removes the need to have an intermediate data processing infrastructure. This helps many mission-critical real time applications to adopt machine learning more easily. At the end of the talk we will walk through a concrete example with a demo to showcase the usage we described.
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...confluent
Cloud providers like AWS allow free data transfers within an Availability Zone (AZ), but bill users when data moves between AZs. When the data volume streamed through Kafka reaches big data scale, (e.g. numeric data points or user activity tracking), the costs incurred by cross-AZ traffic can add significantly to your monthly cloud spend. Since Kafka serves reads and writes only from leader partitions, for a topic with a replication factor of 3, a message sent through Kafka can cross AZs up to 4 times. Once when a producer produces a message onto broker in a different AZ, two times during Kafka replication, and once more during message consumption. With careful design, we can eliminate the first and last part of the cross AZ traffic. We can also use message compression strategies provided by Kafka to reduce costs during replication. In this talk, we will discuss the architectural choices that allow us to ensure a Kafka message is produced and consumed within a single AZ, as well as an algorithm that lets consumers intelligently subscribe to partitions with leaders in the same AZ. We will also cover use cases in which cross-AZ message streaming is unavoidable due to design limitations. Talk outline: 1) A review of Kafka replication, 2) Cross-AZ traffic implications, 3) Architectural choices for AZ-aware message streaming, 4) Algorithms for AZ-aware producers and consumers, 5) Results, 6) Limitations, 7) Takeaways.
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Yaroslav Tkachenko
What can be easier than building a data pipeline nowadays? You add a few Apache Kafka clusters, some way to ingest data (probably over HTTP), design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse... wait, it does start to look like A LOT of things, doesn't it? And you probably want to make it highly scalable and available in the end, correct?
We've been developing a data pipeline in Demonware/Activision for a while. We learned how to scale it not only in terms of messages/sec it can handle, but also in terms of supporting more games and more use-cases.
In this presentation you'll hear about the lessons we learned, including (but not limited to):
- Message schemas
- Apache Kafka organization and tuning
- Topics naming conventions, structure and routing
- Reliable and scalable producers and ingestion layer
- Stream processing
(Bill Bejeck, Confluent) Kafka Summit SF 2018
Apache Kafka added a powerful stream processing library in mid-2016, Kafka Streams, which runs on top of Apache Kafka. The community has embraced Kafka Streams with many early adopters, and the adoption rate continues to grow. Large to mid-size organizations have come to rely on Kafka Streams in their production environments. Kafka Streams has many advanced features to make applications more robust.
The point of this presentation is to show users of Kafka Streams some of the latest and greatest features, as well as some that may be advanced, that can make streams applications more resilient. The target audience for this talk are those users already comfortable writing Kafka Streams applications and want to go from writing their first proof-of-concept applications to writing robust applications that can withstand the rigor that running in a production environment demands.
The talk will be a technical deep dive covering topics like:
-Best practices on configuring a Kafka Streams application
-How to meet production SLAs by minimizing failover and recovery times: configuring standby tasks and the pros and cons of having standby replicas for local state
-How to improve resiliency and 24×7 operability: the use of different configurable error handlers, callbacks and how they can be used to see what’s going on inside the application
-How to achieve efficient scalability: a thorough review of the relationship between the number of instances, threads and state stores and how they relate to each other
While this is a technical deep dive, the talk will also present sample code so that attendees can view the concepts discussed in practice. Attendees of this talk will walk away with a deeper understanding of how Kafka Streams works, and how to make their Kafka Streams applications more robust and efficient. There will be a mix of discussion.
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...confluent
Kafka Streams is a library for developing applications for processing records from topics in Apache Kafka. It provides high-level Streams DSL and low-level Processor API for describing fault-tolerant distributed streaming pipelines in Java or Scala programming languages. Kafka Streams also offers elaborate API for stateless and stateful stream processing. That’s a high-level view of Kafka Streams. Have you ever wondered how Kafka Streams does all this and what the relationship with Apache Kafka (brokers) is? That’s among the topics of the talk.
During this talk we will look under the covers of Kafka Streams and deep dive into Kafka Streams’ Fault-Tolerant Distributed Stream Processing Engine. You will know the role of StreamThreads, TaskManager, StreamTasks, StandbyTasks, StreamsPartitionAssignor, RebalanceListener and few others. The aim of this talk is to get you equipped with knowledge about the internals of Kafka Streams that should help you fine-tune your stream processing pipelines for better performance.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Building a Replicated Logging System with Apache KafkaGuozhang Wang
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. I am going to talk about our design
and engineering experience to replicate Kafka logs for various
distributed data-driven systems, including
source-of-truth data storage and stream processing.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...HostedbyConfluent
This talk is aimed to give developers who are interested to scale their streaming application with Exactly-Once (EOS) guarantees. Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library. We would also present how the EOS code can be simplified with plain Producer and Consumer. Come to learn more if you wish to adopt this improved EOS feature and get started on building your own EOS application today!
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...confluent
In mission-critical real time applications, using machine learning to analyze streaming data are gaining momentum. In those applications Apache Kafka is the most widely used framework to process the data streams. It typically works with other machine learning frameworks for model inference and training purposes. In this talk, our focus is to discuss the KafkaDataset module in TensorFlow. KafkaDataset processes Kafka streaming data directly to TensorFlow’s graph. As a part of Tensorflow (in ‘tf.contrib’), the implementation of KafkaDataset is mostly written in C++. The module exposes a machine learning friendly Python interface through Tensorflow’s ‘tf.data’ API. It could be directly feed to ‘tf.keras’ and other TensorFlow modules for training and inferencing purposes. Combined with Kafka streaming itself, the KafkaDataset module in TensorFlow removes the need to have an intermediate data processing infrastructure. This helps many mission-critical real time applications to adopt machine learning more easily. At the end of the talk we will walk through a concrete example with a demo to showcase the usage we described.
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...confluent
Cloud providers like AWS allow free data transfers within an Availability Zone (AZ), but bill users when data moves between AZs. When the data volume streamed through Kafka reaches big data scale, (e.g. numeric data points or user activity tracking), the costs incurred by cross-AZ traffic can add significantly to your monthly cloud spend. Since Kafka serves reads and writes only from leader partitions, for a topic with a replication factor of 3, a message sent through Kafka can cross AZs up to 4 times. Once when a producer produces a message onto broker in a different AZ, two times during Kafka replication, and once more during message consumption. With careful design, we can eliminate the first and last part of the cross AZ traffic. We can also use message compression strategies provided by Kafka to reduce costs during replication. In this talk, we will discuss the architectural choices that allow us to ensure a Kafka message is produced and consumed within a single AZ, as well as an algorithm that lets consumers intelligently subscribe to partitions with leaders in the same AZ. We will also cover use cases in which cross-AZ message streaming is unavoidable due to design limitations. Talk outline: 1) A review of Kafka replication, 2) Cross-AZ traffic implications, 3) Architectural choices for AZ-aware message streaming, 4) Algorithms for AZ-aware producers and consumers, 5) Results, 6) Limitations, 7) Takeaways.
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Yaroslav Tkachenko
What can be easier than building a data pipeline nowadays? You add a few Apache Kafka clusters, some way to ingest data (probably over HTTP), design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse... wait, it does start to look like A LOT of things, doesn't it? And you probably want to make it highly scalable and available in the end, correct?
We've been developing a data pipeline in Demonware/Activision for a while. We learned how to scale it not only in terms of messages/sec it can handle, but also in terms of supporting more games and more use-cases.
In this presentation you'll hear about the lessons we learned, including (but not limited to):
- Message schemas
- Apache Kafka organization and tuning
- Topics naming conventions, structure and routing
- Reliable and scalable producers and ingestion layer
- Stream processing
(Bill Bejeck, Confluent) Kafka Summit SF 2018
Apache Kafka added a powerful stream processing library in mid-2016, Kafka Streams, which runs on top of Apache Kafka. The community has embraced Kafka Streams with many early adopters, and the adoption rate continues to grow. Large to mid-size organizations have come to rely on Kafka Streams in their production environments. Kafka Streams has many advanced features to make applications more robust.
The point of this presentation is to show users of Kafka Streams some of the latest and greatest features, as well as some that may be advanced, that can make streams applications more resilient. The target audience for this talk are those users already comfortable writing Kafka Streams applications and want to go from writing their first proof-of-concept applications to writing robust applications that can withstand the rigor that running in a production environment demands.
The talk will be a technical deep dive covering topics like:
-Best practices on configuring a Kafka Streams application
-How to meet production SLAs by minimizing failover and recovery times: configuring standby tasks and the pros and cons of having standby replicas for local state
-How to improve resiliency and 24×7 operability: the use of different configurable error handlers, callbacks and how they can be used to see what’s going on inside the application
-How to achieve efficient scalability: a thorough review of the relationship between the number of instances, threads and state stores and how they relate to each other
While this is a technical deep dive, the talk will also present sample code so that attendees can view the concepts discussed in practice. Attendees of this talk will walk away with a deeper understanding of how Kafka Streams works, and how to make their Kafka Streams applications more robust and efficient. There will be a mix of discussion.
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...confluent
Kafka Streams is a library for developing applications for processing records from topics in Apache Kafka. It provides high-level Streams DSL and low-level Processor API for describing fault-tolerant distributed streaming pipelines in Java or Scala programming languages. Kafka Streams also offers elaborate API for stateless and stateful stream processing. That’s a high-level view of Kafka Streams. Have you ever wondered how Kafka Streams does all this and what the relationship with Apache Kafka (brokers) is? That’s among the topics of the talk.
During this talk we will look under the covers of Kafka Streams and deep dive into Kafka Streams’ Fault-Tolerant Distributed Stream Processing Engine. You will know the role of StreamThreads, TaskManager, StreamTasks, StandbyTasks, StreamsPartitionAssignor, RebalanceListener and few others. The aim of this talk is to get you equipped with knowledge about the internals of Kafka Streams that should help you fine-tune your stream processing pipelines for better performance.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
"Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved.
What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data?
What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output?
When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency?
How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions.
These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem."
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
Introduction to apache kafka, confluent and why they matterPaolo Castagna
This is a short and introductory presentation on Apache Kafka (including Kafka Connect APIs, Kafka Streams APIs, both part of Apache Kafka) and other open source components part of the Confluent platform (such as KSQL).
This was the first Kafka Meetup in South Africa.
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.
Stream Analytics with SQL on Apache FlinkFabian Hueske
Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API.
SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined.
In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
High level introduction to Kafka Connect and Kafka Streams, two components of the Apache Kafka open source framework. See the concepts, architecture and features.
The presentation explains the reasons we picked Kafka as Streaming Hub and the use of Kafka Streams to avoid common anti-patterns, streamline development experience, improve resilience, enhance performances and enable experimentation. A step-by-step example will be presented to introduce the Kafka Streams DSL and understand what happens under the hood of a stateful streaming application.
Enabling Data Scientists to easily create and own Kafka Consumers | Stefan Kr...HostedbyConfluent
At Stitch Fix, we hire Full Stack Data Scientists (150+) and expect them to perform diverse functions: from conception to modeling to implementation to measurement. Since Kafka is the way we get event data, this inevitably means that a Data Scientist will need to write a Kafka consumer if they’re going to complete their implementation work. E.g. to transform some client data into features, or perform a model prediction, or allocate someone to an A/B test, etc. In this talk I’ll go over how we built an opinionated Kafka client to easily enable Data Scientists to deploy and own production Kafka consumers, by focusing on writing python functions rather than fighting pitfalls with Kafka.
Enabling Data Scientists to easily create and own Kafka ConsumersStefan Krawczyk
At Stitch Fix, we hire Full Stack Data Scientists (145+) and expect them to perform diverse functions: from conception to modeling to implementation to measurement. Since Kafka is the way we get event data, this inevitably means that a Data Scientist will need to write a Kafka consumer if they’re going to complete their implementation work. E.g. to transform some client data into features, or perform a model prediction, or allocate someone to an A/B test, etc. In this talk I’ll go over how we built an opinionated Kafka client to easily enable Data Scientists to deploy and own production Kafka consumers, by focusing on writing python functions rather than fighting pitfalls with Kafka.
What is Apache Kafka and What is an Event Streaming Platform?confluent
Speaker: Gabriel Schenker, Lead Curriculum Developer, Confluent
Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? Part messaging system, part Hadoop made fast, part fast ETL and scalable data integration. With Apache Kafka® at the core, event streaming platforms offer an entirely new perspective on managing the flow of data. This talk will explain what an event streaming platform such as Apache Kafka is and some of the use cases and design patterns around its use—including several examples of where it is solving real business problems. New developments in this area such as KSQL will also be discussed.
Spark streaming State of the Union - Strata San Jose 2015Databricks
The lead developer of the Apache Spark Streaming library at Databricks, Tathagata "TD" Das, provides an overview of Spark streaming and previews what's the come.
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...Matt Stubbs
Date: 13th November 2018
Location: Fast Data Theatre
Time: 15:50 - 16:20
Speaker: Dean Wampler
Organisation: Lightbend
About: What if you used microservices for streaming data processing, rather than systems like Spark? I'll examine Kafka-based, microservice applications that use Akka Streams and Kafka Streams libraries for stream processing. I'll discuss the strengths and weaknesses of each tool for particular design needs, with lessons that are applicable to other library choices, too. I'll also contrast them with Spark Streaming and Flink; when should you choose them instead?
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
6. 6
What is Kafka, Really?
[NetDB 2011]
[Hadoop Summit 2013]
[VLDB 2015]
a scalable Pub-sub messaging system..
a real-time data pipeline..
a distributed and replicated log..
7. 7
Example: Data Store Geo-Replication
Apache Kafka
Local Stores
User Apps User Apps
Local Stores
Apache Kafka
Region 2Region 1
write read
append log
mirroring
apply log
8. 8
What is Kafka, Really?
a scalable Pub-sub messaging system.. [NetDB 2011]
a real-time data pipeline.. [Hadoop Summit 2013]
a distributed and replicated log.. [VLDB 2015]
a unified data integration stack.. [CIDR 2015]
10. 10
Which of the following is true?
a scalable Pub-sub messaging system.. [NetDB 2011]
a real-time data pipeline.. [Hadoop Summit 2013]
a distributed and replicated log.. [VLDB 2015]
a unified data integration stack.. [CIDR 2015]
11. 11
Which of the following is true?
a scalable Pub-sub messaging system.. [NetDB 2011]
a real-time data pipeline.. [Hadoop Summit 2013]
a distributed and replicated log.. [VLDB 2015]
a unified data integration stack.. [CIDR 2015]
All of them!
12. 12
Kafka: Streaming Platform
• Publish / Subscribe
• Move data around as online streams
• Store
• “Source-of-truth” continuous data
• Process
• React / process data in real-time
13. 13
Kafka: Streaming Platform
• Publish / Subscribe
• Move data around as online streams
• Store
• “Source-of-truth” continuous data
• Process
• React / process data in real-time
Producer
Producer
Consumer
Consumer
14. 14
Kafka: Streaming Platform
• Publish / Subscribe
• Move data around as online streams
• Store
• “Source-of-truth” continuous data
• Process
• React / process data in real-time
Producer
Producer
Consumer
Consumer Connect
Connect
17. 17
Kafka: Streaming Platform
• Publish / Subscribe
• Move data around as online streams
• Store
• “Source-of-truth” continuous data
• Process
• React / process data in real-time
Streams
Producer
Producer
Consumer
Consumer Connect
Connect
Streams
18. 18
Kafka: Streaming Platform
• Publish / Subscribe
• Move data around as online streams
• Store
• “Source-of-truth” continuous data
• Process
• React / process data in real-time
Streams
Producer
Producer
Consumer
Consumer Connect
Connect
Streams
Today’s Talk
19. 19
Stream Processing
• A different programming paradigm
• .. that brings computation to unbounded data
• .. with tradeoffs between latency / cost / correctness
28. Kafka Streams DSL
28
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
29. Kafka Streams DSL
29
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
30. Kafka Streams DSL
30
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
31. Kafka Streams DSL
31
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
32. Kafka Streams DSL
32
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
66. The Stream-Table Duality
• A stream is a changelog of a table
• A table is a materialized view at time of a stream
• Example: change data capture (CDC) of databases
66
67. KStream = interprets data as record stream
~ think: “append-only”
KTable = data as changelog stream
~ continuously updated materialized view
67
68. 68
alice eggs bob lettuce alice milk
alice lnkd bob googl alice msft
KStream
KTable
User purchase history
User employment profile
69. 69
alice eggs bob lettuce alice milk
alice lnkd bob googl alice msft
KStream
KTable
User purchase history
User employment profile
time
“Alice bought eggs.”
“Alice is now at LinkedIn.”
70. 70
alice eggs bob lettuce alice milk
alice lnkd bob googl alice msft
KStream
KTable
User purchase history
User employment profile
time
“Alice bought eggs and milk.”
“Alice is now at LinkedIn
Microsoft.”
71. 71
alice 2 bob 10 alice 3
timeKStream.aggregate()
KTable.aggregate()
(key: Alice, value: 2)
(key: Alice, value: 2)
72. 72
alice 2 bob 10 alice 3
time
(key: Alice, value: 2 3)
(key: Alice, value: 2+3)
KStream.aggregate()
KTable.aggregate()
100. 100
• Ordering
• Partitioning &
Scalability
• Fault tolerance
Stream Processing Hard Parts
• State Management
• Time, Window &
Out-of-order Data
• Re-processing
For more details: http://docs.confluent.io/current
101. 101
• Ordering
• Partitioning &
Scalability
• Fault tolerance
Stream Processing Hard Parts
• State Management
• Time, Window &
Out-of-order Data
• Re-processing
Simple is Beautiful
For more details: http://docs.confluent.io/current
102. Ongoing Work (0.10.1+)
• Beyond Java APIs
• SQL support, Python client, etc
• End-to-End Semantics (exactly-once)
• … and more
102
103. Take-aways
• Apache Kafka: a centralized streaming platform
• Kafka Streams: stream processing made easy
103