Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...confluent
Over the years at Yelp, we have relied on Kafka to build many complex applications and stream processing data-pipelines that solve a multitude of use cases, including powering our product experimentation workflow, search indexing, asynchronous task processing and more. Today, Kafka is at the core of our infrastructure. These applications use different versions of Kafka clients and different programming languages.To fulfill the requirements of these diverse use cases, we run several specialized Kafka clusters for high-availability, consistency, exactly-once and infinite retention. We endeavor to keep our clusters up-to-date with newer Kafka versions that bring with them several critical bug fixes and exciting features like dynamic broker configuration, exactly-once semantics, kafka offset management and improved tooling. Our journey with Kafka started with version 0.8.2.0. Upgrading Kafka while ensuring client compatibility, zero-downtime, negligible performance degradation across our ever-growing multi-regional cluster deployment exposed us to a plethora of unique challenges. This session will focus on the challenges we encountered and how we evolved our infrastructure tooling and upgrade strategy to overcome them. I will be talking about: -- How we rolled out new features such as kafka offset storage, message timestamp, reassignment auto-throttling, etc. -- Core technical issues discovered during upgrades such as failure of log cleaners due to large offsets while upgrading. -- The in-house test-suite that we built in order to: validate new kafka versions against our existing tooling and client-libraries, exercise the upgrade and rollback process and benchmark performance. -- The automation we built for safe and fast rolling upgrades and broker configuration deployment.
Upgrades suck. We get it. They are risky and time consuming and you have better things to do. In this talk we'll present good reasons to upgrade anyway and give suggestions on how to de-risk your upgrades. Straight from the team that upgrades Kafka almost every week. We'll review all the releases in the past year - major, minor and bug-fixes. We'll explain the differences between those and what can you expect from each. We'll go into the most important features and most critical fixes and improvements, so you'll have ample ammunition when you explain to your boss why you really have to upgrade Kafka. Then we'll discuss how we validate new releases and suggest a safe upgrade process - because we know that uneventful upgrades are a key to the next upgrade.
Design and Implementation of Incremental Cooperative Rebalancingconfluent
Watch this talk here: https://www.confluent.io/online-talks/design-and-implementation-of-incremental-cooperative-rebalancing-on-demand
Since its initial release, the Kafka group membership protocol has offered Connect, Streams and Consumer applications an ingenious and robust way to balance resources among distributed processes. The process of rebalancing, as it’s widely known, allows Kafka APIs to define an embedded protocol for load balancing within the group membership protocol itself.
Until now, rebalancing has been working under the simple assumption that every time a new group generation is created, the members join after first releasing all of their resources, getting a whole new load assignment by the time the new group is formed. This allows Kafka APIs to provide task fault-tolerance and elasticity on top of the group membership protocol.
However, due to its side-effects on multi-tenancy and scalability this simple approach in rebalancing, also known as stop-the-world effect, is limiting larger scale deployments. Because of stop-the-world, application tasks get interrupted only for most of them to receive the same resources after rebalancing. In this technical deep dive, we’ll discuss the proposition of Incremental Cooperative Rebalancing as a way to alleviate stop-the-world and optimize rebalancing in Kafka APIs.
This talk will cover:
-The internals of Incremental Cooperative Rebalancing
-Uses cases that benefit from Incremental Cooperative Rebalancing
-Implementation in Kafka Connect
-Performance results in Kafka Connect clusters
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...confluent
Getting Kafka running on Kubernetes is only step one of a journey to create a production-ready Kafka cluster. This talk walks through the other steps: 1) Monitoring and remediating faults. 2) Updates to Kubernetes nodes for clusters not using shared storage. 3) Automating Kafka updates and restarts. We present how to create fault-tolerant Kafka clusters on Kubernetes without sacrificing availability, durability, or latency. Learn about Lyft's overlay-free Kubernetes networking driver and how we use it to keep performance on par with non-Kubernetes clusters.
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020HostedbyConfluent
What does a Kafka administrator need to do if they have a user who demands that message delivery be guaranteed, fast, and low cost? In this talk we walk through the architecture we created to deliver for such users. Learn around the alternatives we considered and the pros and cons around what we came up with.
In this talk, we’ll be forced to dive into broker restart and failure scenarios and things we need to do to prevent leader elections from slowing down incoming requests. We’ll need to take care of the consumers as well to ensure that they don’t process the same request twice. We also plan to describe our architecture by showing a demo of simulated requests being produced into Kafka clusters and consumers processing them in lieu of us aggressively causing failures on the Kafka clusters.
We hope the audience walks away with a deeper understanding of what it takes to build robust Kafka clients and how to tune them to accomplish stringent delivery guarantees.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...confluent
Over the years at Yelp, we have relied on Kafka to build many complex applications and stream processing data-pipelines that solve a multitude of use cases, including powering our product experimentation workflow, search indexing, asynchronous task processing and more. Today, Kafka is at the core of our infrastructure. These applications use different versions of Kafka clients and different programming languages.To fulfill the requirements of these diverse use cases, we run several specialized Kafka clusters for high-availability, consistency, exactly-once and infinite retention. We endeavor to keep our clusters up-to-date with newer Kafka versions that bring with them several critical bug fixes and exciting features like dynamic broker configuration, exactly-once semantics, kafka offset management and improved tooling. Our journey with Kafka started with version 0.8.2.0. Upgrading Kafka while ensuring client compatibility, zero-downtime, negligible performance degradation across our ever-growing multi-regional cluster deployment exposed us to a plethora of unique challenges. This session will focus on the challenges we encountered and how we evolved our infrastructure tooling and upgrade strategy to overcome them. I will be talking about: -- How we rolled out new features such as kafka offset storage, message timestamp, reassignment auto-throttling, etc. -- Core technical issues discovered during upgrades such as failure of log cleaners due to large offsets while upgrading. -- The in-house test-suite that we built in order to: validate new kafka versions against our existing tooling and client-libraries, exercise the upgrade and rollback process and benchmark performance. -- The automation we built for safe and fast rolling upgrades and broker configuration deployment.
Upgrades suck. We get it. They are risky and time consuming and you have better things to do. In this talk we'll present good reasons to upgrade anyway and give suggestions on how to de-risk your upgrades. Straight from the team that upgrades Kafka almost every week. We'll review all the releases in the past year - major, minor and bug-fixes. We'll explain the differences between those and what can you expect from each. We'll go into the most important features and most critical fixes and improvements, so you'll have ample ammunition when you explain to your boss why you really have to upgrade Kafka. Then we'll discuss how we validate new releases and suggest a safe upgrade process - because we know that uneventful upgrades are a key to the next upgrade.
Design and Implementation of Incremental Cooperative Rebalancingconfluent
Watch this talk here: https://www.confluent.io/online-talks/design-and-implementation-of-incremental-cooperative-rebalancing-on-demand
Since its initial release, the Kafka group membership protocol has offered Connect, Streams and Consumer applications an ingenious and robust way to balance resources among distributed processes. The process of rebalancing, as it’s widely known, allows Kafka APIs to define an embedded protocol for load balancing within the group membership protocol itself.
Until now, rebalancing has been working under the simple assumption that every time a new group generation is created, the members join after first releasing all of their resources, getting a whole new load assignment by the time the new group is formed. This allows Kafka APIs to provide task fault-tolerance and elasticity on top of the group membership protocol.
However, due to its side-effects on multi-tenancy and scalability this simple approach in rebalancing, also known as stop-the-world effect, is limiting larger scale deployments. Because of stop-the-world, application tasks get interrupted only for most of them to receive the same resources after rebalancing. In this technical deep dive, we’ll discuss the proposition of Incremental Cooperative Rebalancing as a way to alleviate stop-the-world and optimize rebalancing in Kafka APIs.
This talk will cover:
-The internals of Incremental Cooperative Rebalancing
-Uses cases that benefit from Incremental Cooperative Rebalancing
-Implementation in Kafka Connect
-Performance results in Kafka Connect clusters
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...confluent
Getting Kafka running on Kubernetes is only step one of a journey to create a production-ready Kafka cluster. This talk walks through the other steps: 1) Monitoring and remediating faults. 2) Updates to Kubernetes nodes for clusters not using shared storage. 3) Automating Kafka updates and restarts. We present how to create fault-tolerant Kafka clusters on Kubernetes without sacrificing availability, durability, or latency. Learn about Lyft's overlay-free Kubernetes networking driver and how we use it to keep performance on par with non-Kubernetes clusters.
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020HostedbyConfluent
What does a Kafka administrator need to do if they have a user who demands that message delivery be guaranteed, fast, and low cost? In this talk we walk through the architecture we created to deliver for such users. Learn around the alternatives we considered and the pros and cons around what we came up with.
In this talk, we’ll be forced to dive into broker restart and failure scenarios and things we need to do to prevent leader elections from slowing down incoming requests. We’ll need to take care of the consumers as well to ensure that they don’t process the same request twice. We also plan to describe our architecture by showing a demo of simulated requests being produced into Kafka clusters and consumers processing them in lieu of us aggressively causing failures on the Kafka clusters.
We hope the audience walks away with a deeper understanding of what it takes to build robust Kafka clients and how to tune them to accomplish stringent delivery guarantees.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Kafka Streams: Revisiting the decisions of the past (How I could have made it...confluent
Kafka Streams: Revisiting the decisions of the past (How I could have made it better), Jason Bell, Kafka DevOps Engineer @ Digitalis.io
https://www.meetup.com/Cleveland-Kafka/events/272339276/
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...confluent
"To provide exceptional customer experiences at scale, the data pipelines that can move data reliably across the systems and applications in real-time should be seamlessly scalable. For the past several years, we relied on Message Queue based data pipelines to facilitate the transfer of data across the applications. However, as the number of use cases that require real-time data transfer increased rapidly, it became difficult to scale the messaging platform. Moving to Kafka helped us to resolve the data pipeline scaling issues and reduce the Publisher/Subscriber on-boarding time from several weeks to a few days. To support the on-demand scaling of Kafka clusters, we run them on RedHat OpenShift, an Enterprise Kubernetes. While managing Kafka that handles critical financial events, we have learned some lessons and developed efficient strategies to manage production-grade Kafka clusters on OpenShift. In this talk, we will present:
1. Some of the challenges that we faced with Kafka on OpenShift and how we evolved our infrastructure to overcome them.
2. Share our experiences from operating Kafka clusters at Scale in Production.
3. Our strategy for performing automated Kafka deployment and rollback in OpenShift.
4. Explain our fail-over strategy using Confluent’s Replicator to ensure service availability during cluster failures."
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
We present Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
Recently, the interest in highly scalable stream processing engines has risen, thus many projects have appeared. Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, and resource management. It is one of the most popular stream processing engines out there used by many high-profile companies. On the other hand, we have Amazon Kinesis that is a fully managed service for real-time processing of streaming data which allows users to scale the amount of data ingested by Kinesis without worrying about the infrastructure details. This presentation gives a brief introduction about the very popular Samza-Kafka integration, then focuses on the new Samza-Kinesis integration, and explains users the new opportunities they have due to the new Samza-Kinesis integration.
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulHostedbyConfluent
Apache Kafka is the most popular open-source stream-processing software for collecting, processing, storing, and analyzing data at scale. Most known for its excellent performance, low latency, fault tolerance, and high throughput, it's capable of handling thousands of messages per second. For mission-critical applications, how do you ensure that the performance delivered is the performance required? This is especially important as Kafka is written in Java and Scala and runs on the JVM. The JVM is a fantastic platform that delivers on an internet scale. In this session, we'll explore how making changes to the JVM design can eliminate the problems of garbage collection pauses and raise the throughput of applications. For cloud-based Kafka applications, this can deliver both lower latency and reduced infrastructure costs. All without changing a line of code!
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatHostedbyConfluent
When your Kafka clusters start growing so is the cost associated with them. As administrators we have to ensure that the service we support is operating in the most reliable way to satisfy the customers. However, for our business it is as important that we ensure the same service is also cost-efficient. There are two ways we can optimize the cost of service – tuning broker machines and tuning the data transfers. Minimizing data transfer is the largest return on investment since that is what accounts for the most spend. With the use of Kafka administrative tools and metrics we can find multiple ways to reduce the data transfers in the clusters.
The presentation will cover various techniques administrators of Kafka service can employ to reduce the data transfers and to save the operational costs. Reducing cross-AZ traffic, optimizing batching with use of DumpLogSegment script, utilizing Kafka metrics to shut down unused data streams and more.
With an objective of making our Kafka deployment as cost effective as possible, we have gained money saving tricks. And we would love to share them with the community.
Introducing Exactly Once Semantics To Apache KafkaApurva Mehta
Here are slides from my talk on introducing exactly once semantics to Apache Kafka. The talk was given at the Kafka Summit NYC, 8 May 2017.
The slides dive into the design of transactions in Apache Kafka.
El día 21 de Septiembre, tuvimos el placer de acoger en nuestras oficinas un Meetup impartido por nuestro compañero Paco Guerrero sobre la plataforma Apache Flink.
"Apache Flink es una plataforma open source de procesamiento en tiempo real, que está en auge al ofrecer características de las que otras tecnologías con las que compite no disponen, sin impacto en su rendimiento. En esta formación introduciremos la filosofía y motor de procesamiento que hace a Flink tan especial y potente. También recorreremos los pilares básicos que confirman a Flink como la plataforma de streaming más prometedora actualmente"
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
Kafka is notoriously tricky for multi-dc use cases. The log abstraction and client failover breaks down when you cannot at least guarantee offset consistency. In this talk, we define the current state of Kafka in terms of multi-dc usage, how different approaches provide different guarantees as well as examining the missing gaps, and how the community is addressing them.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
Essential ingredients for real time stream processing @Scale by Kartik pParam...Big Data Spain
At LinkedIn, we ingest more than 1 Trillion events per day pertaining to user behavior, application and system health etc. into our pub-sub system (Kafka). Another source of events are the updates that are happening on our SQL and No-SQL databases. For e.g. every time a user changes their linkedIn profile, a ton of downstream applications need to know what happened and need to react to it. We have a system (DataBus) which listens to changes in the database transaction logs and makes them available for down stream processing. We process ~2.1 Trillion of such database change events per week.
We use Apache Samza for processing these event-streams in real time. In this presentation we will discuss some of challenges we faced and the various techniques we used to overcome them.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.bigdataspain.org/program/thu/slot-3.html
The talk describes how Yelp deploys Zipkin and integrates it with its 250+ services. It also goes through the challenges faced during scaling it up and how we tuned it up.
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
As distributed cloud applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical.
Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works.
We’ll explore two complementary Open Source technologies:
Prometheus for monitoring application metrics, and
OpenTracing and Jaeger for distributed tracing.
We’ll discover how they improve the observability of
an Anomaly Detection application, deployed on AWS Kubernetes, and using Instaclustr managed Apache Cassandra and Kafka clusters.
Low latency scalable web crawling on Apache StormJulien Nioche
In this talk I will introduce Storm-Crawler https://github.com/DigitalPebble/storm-crawler, a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Kafka Streams: Revisiting the decisions of the past (How I could have made it...confluent
Kafka Streams: Revisiting the decisions of the past (How I could have made it better), Jason Bell, Kafka DevOps Engineer @ Digitalis.io
https://www.meetup.com/Cleveland-Kafka/events/272339276/
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...confluent
"To provide exceptional customer experiences at scale, the data pipelines that can move data reliably across the systems and applications in real-time should be seamlessly scalable. For the past several years, we relied on Message Queue based data pipelines to facilitate the transfer of data across the applications. However, as the number of use cases that require real-time data transfer increased rapidly, it became difficult to scale the messaging platform. Moving to Kafka helped us to resolve the data pipeline scaling issues and reduce the Publisher/Subscriber on-boarding time from several weeks to a few days. To support the on-demand scaling of Kafka clusters, we run them on RedHat OpenShift, an Enterprise Kubernetes. While managing Kafka that handles critical financial events, we have learned some lessons and developed efficient strategies to manage production-grade Kafka clusters on OpenShift. In this talk, we will present:
1. Some of the challenges that we faced with Kafka on OpenShift and how we evolved our infrastructure to overcome them.
2. Share our experiences from operating Kafka clusters at Scale in Production.
3. Our strategy for performing automated Kafka deployment and rollback in OpenShift.
4. Explain our fail-over strategy using Confluent’s Replicator to ensure service availability during cluster failures."
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
We present Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
Recently, the interest in highly scalable stream processing engines has risen, thus many projects have appeared. Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, and resource management. It is one of the most popular stream processing engines out there used by many high-profile companies. On the other hand, we have Amazon Kinesis that is a fully managed service for real-time processing of streaming data which allows users to scale the amount of data ingested by Kinesis without worrying about the infrastructure details. This presentation gives a brief introduction about the very popular Samza-Kafka integration, then focuses on the new Samza-Kinesis integration, and explains users the new opportunities they have due to the new Samza-Kinesis integration.
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulHostedbyConfluent
Apache Kafka is the most popular open-source stream-processing software for collecting, processing, storing, and analyzing data at scale. Most known for its excellent performance, low latency, fault tolerance, and high throughput, it's capable of handling thousands of messages per second. For mission-critical applications, how do you ensure that the performance delivered is the performance required? This is especially important as Kafka is written in Java and Scala and runs on the JVM. The JVM is a fantastic platform that delivers on an internet scale. In this session, we'll explore how making changes to the JVM design can eliminate the problems of garbage collection pauses and raise the throughput of applications. For cloud-based Kafka applications, this can deliver both lower latency and reduced infrastructure costs. All without changing a line of code!
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatHostedbyConfluent
When your Kafka clusters start growing so is the cost associated with them. As administrators we have to ensure that the service we support is operating in the most reliable way to satisfy the customers. However, for our business it is as important that we ensure the same service is also cost-efficient. There are two ways we can optimize the cost of service – tuning broker machines and tuning the data transfers. Minimizing data transfer is the largest return on investment since that is what accounts for the most spend. With the use of Kafka administrative tools and metrics we can find multiple ways to reduce the data transfers in the clusters.
The presentation will cover various techniques administrators of Kafka service can employ to reduce the data transfers and to save the operational costs. Reducing cross-AZ traffic, optimizing batching with use of DumpLogSegment script, utilizing Kafka metrics to shut down unused data streams and more.
With an objective of making our Kafka deployment as cost effective as possible, we have gained money saving tricks. And we would love to share them with the community.
Introducing Exactly Once Semantics To Apache KafkaApurva Mehta
Here are slides from my talk on introducing exactly once semantics to Apache Kafka. The talk was given at the Kafka Summit NYC, 8 May 2017.
The slides dive into the design of transactions in Apache Kafka.
El día 21 de Septiembre, tuvimos el placer de acoger en nuestras oficinas un Meetup impartido por nuestro compañero Paco Guerrero sobre la plataforma Apache Flink.
"Apache Flink es una plataforma open source de procesamiento en tiempo real, que está en auge al ofrecer características de las que otras tecnologías con las que compite no disponen, sin impacto en su rendimiento. En esta formación introduciremos la filosofía y motor de procesamiento que hace a Flink tan especial y potente. También recorreremos los pilares básicos que confirman a Flink como la plataforma de streaming más prometedora actualmente"
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
Kafka is notoriously tricky for multi-dc use cases. The log abstraction and client failover breaks down when you cannot at least guarantee offset consistency. In this talk, we define the current state of Kafka in terms of multi-dc usage, how different approaches provide different guarantees as well as examining the missing gaps, and how the community is addressing them.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
Essential ingredients for real time stream processing @Scale by Kartik pParam...Big Data Spain
At LinkedIn, we ingest more than 1 Trillion events per day pertaining to user behavior, application and system health etc. into our pub-sub system (Kafka). Another source of events are the updates that are happening on our SQL and No-SQL databases. For e.g. every time a user changes their linkedIn profile, a ton of downstream applications need to know what happened and need to react to it. We have a system (DataBus) which listens to changes in the database transaction logs and makes them available for down stream processing. We process ~2.1 Trillion of such database change events per week.
We use Apache Samza for processing these event-streams in real time. In this presentation we will discuss some of challenges we faced and the various techniques we used to overcome them.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.bigdataspain.org/program/thu/slot-3.html
The talk describes how Yelp deploys Zipkin and integrates it with its 250+ services. It also goes through the challenges faced during scaling it up and how we tuned it up.
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
As distributed cloud applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical.
Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works.
We’ll explore two complementary Open Source technologies:
Prometheus for monitoring application metrics, and
OpenTracing and Jaeger for distributed tracing.
We’ll discover how they improve the observability of
an Anomaly Detection application, deployed on AWS Kubernetes, and using Instaclustr managed Apache Cassandra and Kafka clusters.
Low latency scalable web crawling on Apache StormJulien Nioche
In this talk I will introduce Storm-Crawler https://github.com/DigitalPebble/storm-crawler, a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
A walk through the current state of stream processing, the key differentiators which make Samza stand out in the crowd, what's new in samza and what's coming next.
Foundations for Scaling ML in Apache SparkDatabricks
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
Speaker: Hari Shreedharan
Data Day Texas 2015
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
Failures are inevitable. How can we recover a Flink job from outage? How do we reprocess data from outage period? What are the implications to downstream consumers? These are important questions that we need to answer when running Flink for critical data processing applications. We implemented two solutions for our stream processing platform: (1) use data warehouse, like Hive, as backfill source (2) rewind Flink job using external checkpoint. We will describe both solutions in details, and discuss the pros and cons of each approach. We will also take a look at some of the caveats to watch out for.
Samza at LinkedIn: Taking Stream Processing to the Next LevelMartin Kleppmann
Slides from my talk at Berlin Buzzwords, 27 May 2014. Unfortunately Slideshare has screwed up the fonts. See https://speakerdeck.com/ept/samza-at-linkedin-taking-stream-processing-to-the-next-level for a version of the deck with correct fonts.
Stream processing is an essential part of real-time data systems, such as news feeds, live search indexes, real-time analytics, metrics and monitoring. But writing stream processes is still hard, especially when you're dealing with so much data that you have to distribute it across multiple machines. How can you keep the system running smoothly, even when machines fail and bugs occur?
Apache Samza is a new framework for writing scalable stream processing jobs. Like Hadoop and MapReduce for batch processing, it takes care of the hard parts of running your message-processing code on a distributed infrastructure, so that you can concentrate on writing your application using simple APIs. It is in production use at LinkedIn.
This talk will introduce Samza, and show how to use it to solve a range of different problems. Samza has some unique features that make it especially interesting for large deployments, and in this talk we will dig into how they work under the hood. In particular:
• Samza is built to support many different jobs written by different teams. Isolation between jobs ensures that a single badly behaved job doesn't affect other jobs. It is robust by design.
• Samza can handle jobs that require large amounts of state, for example joining multiple streams, augmenting a stream with data from a database, or aggregating data over long time windows. This makes it a very powerful tool for applications.
The slides for Stream Processing Meetup (7/19/2018)(https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/).
This presentation introduces the newly-developed Samza Runner for Apache Beam. You will see the capability of the Samza Runner and how it supports key Beam features. You will also see a few use cases and our future roadmap.
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
Stream processing in python with Apache Samza and BeamHai Lu
Apache Samza is the streaming engine being used at LinkedIn that processes around 2 trillion messages daily. A while back we announced Samza's integration with Apache Beam, a great success which leads to our Samza Beam API. Now an UPGRADE of our APIs - we're now supporting Stream Processing in Python! This work has made stream processing more accessible and enabled many interesting use cases, particularly in the area of machine learning. The Python API is based on our work of Samza runner for Apache Beam. In this talk, we will quickly review our work on Samza runner, and then how we extended it to support portability in Beam (Python specifically). In addition to technical and architectural details, we will also talk about how we bridged Python and Java ecosystems at LinkedIn with the Python API, together with different use cases.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Overcoming Variable Payloads to Optimize for PerformanceScyllaDB
When you have a significant amount of events coming in from individual customers but do not want to spend the majority of your time on latency issues, how do you optimize for performance? This becomes increasingly difficult when you are dealing with payload sizes that are multiple orders of magnitude difference, have complex data that impact processing, and the stream of data is impossible to predict. In this session, you’ll hear from Armin Ronacher, Principal Architect at Sentry and creator of the Flask web framework for Python on how to build ingestion and processing pipelines to accommodate complex events, helping to ensure your teams are reaching a throughput of hundreds of thousands of events per second.
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
http://www.oreilly.com/pub/e/3764
Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.
QCON 2015: Gearpump, Realtime Streaming on AkkaSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Similar to Essential Ingredients of Realtime Stream Processing @ Scale (20)
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
AI Genie Review: World’s First Open AI WordPress Website CreatorGoogle
AI Genie Review: World’s First Open AI WordPress Website Creator
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-genie-review
AI Genie Review: Key Features
✅Creates Limitless Real-Time Unique Content, auto-publishing Posts, Pages & Images directly from Chat GPT & Open AI on WordPress in any Niche
✅First & Only Google Bard Approved Software That Publishes 100% Original, SEO Friendly Content using Open AI
✅Publish Automated Posts and Pages using AI Genie directly on Your website
✅50 DFY Websites Included Without Adding Any Images, Content Or Doing Anything Yourself
✅Integrated Chat GPT Bot gives Instant Answers on Your Website to Visitors
✅Just Enter the title, and your Content for Pages and Posts will be ready on your website
✅Automatically insert visually appealing images into posts based on keywords and titles.
✅Choose the temperature of the content and control its randomness.
✅Control the length of the content to be generated.
✅Never Worry About Paying Huge Money Monthly To Top Content Creation Platforms
✅100% Easy-to-Use, Newbie-Friendly Technology
✅30-Days Money-Back Guarantee
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIGenieApp #AIGenieBonus #AIGenieBonuses #AIGenieDemo #AIGenieDownload #AIGenieLegit #AIGenieLiveDemo #AIGenieOTO #AIGeniePreview #AIGenieReview #AIGenieReviewandBonus #AIGenieScamorLegit #AIGenieSoftware #AIGenieUpgrades #AIGenieUpsells #HowDoesAlGenie #HowtoBuyAIGenie #HowtoMakeMoneywithAIGenie #MakeMoneyOnline #MakeMoneywithAIGenie
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
2. About Me
• ‘Streams Infrastructure’ at LinkedIn
– Pub-sub messaging : Apache Kafka
– Change Capture from various data systems: Databus
– Stream Processing platform : Apache Samza
• Previous
– Microsoft Cloud/IOT Messaging (EventHub) and
Enterprise Messaging(Queues/Topics)
– .NET WebServices and Workflow stack
– BizTalk Server
3. Agenda
• What is Stream Processing ?
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close
14. Basics : Scaling Ingestion
- Streams are partitioned
- Messages sent to partitions
based on PartitionKey
- Time based message
retention
Stream A
producers
Pkey=10
consumerA
(machine1)
consumerA
(machine2)
Pkey=25 Pkey=45
e.g. Kafka, AWS Kinesis, Azure EventHub
16. Samza – Streaming Dataflow
Stream A
Stream c
Stream D
Job 1
Job 2
Stream B
17. Horizontal Scaling is great ! But..
• But more machines means more $$
• Need to do more with less.
• So what’s the key bottleneck during
Event/Stream Processing ?
18. Key Bottleneck: “Accessing Data”
• Big impact on CPU, Network, Disk
• Types of Data Access
1. Adjunct data – Read only data
2. Scratchpad/derived data - Read-Write
data
19. Adjunct Data – typical access
KafkaAdClicks Processing
Job
AdQuality update
Kafka
Member
Database
Read Member Info
Concerns
1. Latency
2. CPU
3. Network
4. DDOS
20. Scratch pad/Derived Data – typical
access
Kafka
Sensor
Data
Processing
Job
Alerts
Kafka
Device
State
Database
Concerns
1. Latency
2. CPU
3. Network
4. DDOS
Read + Update per
Device Info
21. Adjunct Data – with Samza
KafkaAdClicks
Processing Job
output
Kafka
Member
Database
(espresso) Databus
Kafka, Databus, Database, Samza Job are all
partitioned by MemberId
Member
Updates
Task1
Task2
Task3
22. Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Stable State
23. Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Host A dies/fails
24. Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
YARN allocates the
tasks to a container
on a different host!
25. Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
Restore local state by
reading from the
ChangeLog
26. Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
Back to Stable
State
27. Hardware Spec: 24 cores, 1Gig NIC, SSD
• (Baseline) Simple pass through job with no
local state
– 1.2 Million msg/sec
• Samza job with local state
– 400k msg/sec
• Samza job with local state with Kafka backup
– 300k msg/sec
Performance Numbers with Samza
28. Local State - Summary
• Great for both read-only data and read-write
data
• Secret sauce to make local state work
1. Change Capture System: Databus/DynamoDB
streams
2. Durable backup with Kafka Log Compacted
topics
29. Essential Ingredients to Stream
Processing
1. Scale
2. Reprocessing
3. Accuracy of results
4. Easy to program
31. Why do we need it ?
• Software upgrades.. Yes bugs are a reality
• Business logic changes
• First time job deployment
32. Reprocessing Data – with Samza
output
Kafka
Member
Database
(espresso)
Databus
Member
Updates
Company/Title/Lo
cation
StandardIzation
Job
Machine
Learning
modelbootstrap
33. Reprocessing- Caveats
• Stream processors are fast.. They can DOS the
system if you reprocess
– Control max-concurrency of your job
– Quotas for Kafka, Databases
– Async load into databases (Project Venice)
• Capacity
– Reprocessing a 100 TB source ?
• Doesn’t reprocessing mean you are no-longer
being real-time ?
34. Essential Ingredients to Stream
Processing
1. Scale but at not at any cost
2. Reprocessing
3. Accuracy of results
4. Easy to Program
36. Querying over an infinite stream
1.00
pm
Ad View Event
1:01
pm
Ad Click Event
Ad
Quality
Processor
User1
Did user click
the Ad
within 2
minutes of
seeing the
Ad
37. DELAYS – AN
EXAMPLE
Ad Quality
Processor
(Samza)
Services Tier
Kafka
Services Tier
Ad Quality
Processor
(Samza)
KafkaMirrored
kartik
DATACENTER 1 DATACENTER 2
AdViewEvent
L
B
38. DELAYS – AN
EXAMPLE
Real Time
Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time
Processing
(Samza)
KafkaMirrored
kartik
DATACENTER 1 DATACENTER 2
AdClick Event
L
B
39. What do we need to do to get accurate
results?
Deal with
• Late Arrivals
– E.g. AdClick event showed up 5 minutes late.
• Out of order arrival
– E.g. AdClick event showed up before AdView
event
• Influenced by “Google MillWheel”
41. Myth: This isn’t a problem with
Lambda Architecture..
• Theory: Since the processing happens 1 hour
or several hours later delays are not a
problem.
• Ok.. But what about the “edges”
– Some “sessions” start before the cut off time for
processing.. And end after the cut off time.
– Delays and out of order processing make things
worse on the edges
42. Essential Ingredients to Stream
Processing
1. Scale but at not at any cost
2. Reprocessing
3. Accuracy of results
4. Easy Programmability
43. Easy Programmability
• Support for “accurate” Windowing/Joins.
( Google Cloud Dataflow )
• Ability to express workflows/DAGs in config
and DSL (e.g. Storm)
• SQL support for querying over streams
– Azure Stream Insight
• Apache Samza – working on the above
44. Agenda
• Stream processing Intro
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close
45. Some scale numbers at LinkedIn
• 1.3 Trillion Messages get ingested into Kafka per
day
– Each message gets consumed 4-5 times
• Database change capture :
– More than 2 Trillion Messages get consumed per
week
• Samza jobs in production which process more
than 1 Million messages/sec
Note: These numbers are not reflective of LinkedIn Site traffic