Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
Full recorded presentation at https://www.youtube.com/watch?v=2UfAgCSKPZo for Tetrate Tech Talks on 2022/05/13.
Envoy's support for Kafka protocol, in form of broker-filter and mesh-filter.
Contents:
- overview of Kafka (usecases, partitioning, producer/consumer, protocol);
- proxying Kafka (non-Envoy specific);
- proxying Kafka with Envoy;
- handling Kafka protocol in Envoy;
- Kafka-broker-filter for per-connection proxying;
- Kafka-mesh-filter to provide front proxy for multiple Kafka clusters.
References:
- https://adam-kotwasinski.medium.com/deploying-envoy-and-kafka-8aa7513ec0a0
- https://adam-kotwasinski.medium.com/kafka-mesh-filter-in-envoy-a70b3aefcdef
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...HostedbyConfluent
As an AWS shop, Zillow engineering teams have been using various messaging and streaming services for years. As Zillow 2.0 piled through, new requirements and pain points made us rethink our streaming stack. The need for high data quality, decoupling producers & consumers and real time homes data called for a new platform which would empower developers, enable data governance and reduce incidents caused by bad data. In this session, you will learn why Zillow decided to go with Kafka for that platform, what tools we built to meet developers where they are and what common challenges you could face as you migrate other streaming solutions to Kafka.
Extending the Apache Kafka® Replication Protocol Across Clusters, Sanjana Kau...HostedbyConfluent
Extending the Apache Kafka® Replication Protocol Across Clusters, Sanjana Kaundinya | Current 2022
When Apache Kafka® was first created, one of the hallmarks was its native replication protocol, which provided built-in resiliency in the system. As a business scales, there’s a need to have this fault-tolerance transcend beyond the local data center, and a multi-geographic deployment becomes critical. Traditionally, Kafka Connect based solutions have tried their hand at enabling these types of deployments. However, this presents its own set of operational challenges that can be quite costly.
In this talk, we will go over how you can use the existing replication protocol across clusters. You will learn how to use Cluster Linking to run a multi-region data streaming deployment without the burden and operational overhead of running yet another data system. We will discuss:.
* Automation options for creating mirror topics
* Failover processes and caveats to consider
* Handling ACL replication and consumer offset synchronization
* And more!
So, join us on this intergalactic journey to discover how you can use Cluster Linking to decrease your operational overhead, maintain a multi-geographic deployment, and perhaps even reach infinity (and beyond)!
Haitao Zhang, Uber, Software Engineer + Yang Yang, Uber, Senior Software Engineer
Kafka Consumer Proxy is a forwarding proxy that consumes messages from Kafka and dispatches them to a user registered gRPC service endpoint. With Kafka Consumer Proxy, the experience of consuming messages from Apache Kafka for pub-sub use cases is as seamless and user-friendly as receiving (g)RPC requests. In this talk, we will share (1) the motivation for building this service, (2) the high-level architecture, (3) the mechanisms we designed to achieve high availability, scalability, and reliability, and (4) the current adoption status.
https://www.meetup.com/KafkaBayArea/events/273834934/
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
High level introduction to Confluent REST Proxy and Schema Registry (leveraging Apache Avro under the hood), two components of the Apache Kafka open source ecosystem. See the concepts, architecture and features.
Why My Streaming Job is Slow - Profiling and Optimizing Kafka Streams Apps (L...confluent
Kafka Streams performance monitoring and tuning is important for many reasons, including identifying bottlenecks, achieving greater throughput, and capacity planning. In this talk we’ll share the techniques we used to achieve greater performance and save on compute, storage, and cost. We’ll cover: Identifying design bottlenecks in by reviewing logs, metrics, and serdes. State store access patterns, design, and optimization Using profiling tools such as JMX, YourKit etc. Performance tuning of Kafka and Kafka Streams configuration and properties. JVM optimization for correct heap size and garbage collection strategies. Functional programming and imperative programming trade offs.
Full recorded presentation at https://www.youtube.com/watch?v=2UfAgCSKPZo for Tetrate Tech Talks on 2022/05/13.
Envoy's support for Kafka protocol, in form of broker-filter and mesh-filter.
Contents:
- overview of Kafka (usecases, partitioning, producer/consumer, protocol);
- proxying Kafka (non-Envoy specific);
- proxying Kafka with Envoy;
- handling Kafka protocol in Envoy;
- Kafka-broker-filter for per-connection proxying;
- Kafka-mesh-filter to provide front proxy for multiple Kafka clusters.
References:
- https://adam-kotwasinski.medium.com/deploying-envoy-and-kafka-8aa7513ec0a0
- https://adam-kotwasinski.medium.com/kafka-mesh-filter-in-envoy-a70b3aefcdef
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...HostedbyConfluent
As an AWS shop, Zillow engineering teams have been using various messaging and streaming services for years. As Zillow 2.0 piled through, new requirements and pain points made us rethink our streaming stack. The need for high data quality, decoupling producers & consumers and real time homes data called for a new platform which would empower developers, enable data governance and reduce incidents caused by bad data. In this session, you will learn why Zillow decided to go with Kafka for that platform, what tools we built to meet developers where they are and what common challenges you could face as you migrate other streaming solutions to Kafka.
Extending the Apache Kafka® Replication Protocol Across Clusters, Sanjana Kau...HostedbyConfluent
Extending the Apache Kafka® Replication Protocol Across Clusters, Sanjana Kaundinya | Current 2022
When Apache Kafka® was first created, one of the hallmarks was its native replication protocol, which provided built-in resiliency in the system. As a business scales, there’s a need to have this fault-tolerance transcend beyond the local data center, and a multi-geographic deployment becomes critical. Traditionally, Kafka Connect based solutions have tried their hand at enabling these types of deployments. However, this presents its own set of operational challenges that can be quite costly.
In this talk, we will go over how you can use the existing replication protocol across clusters. You will learn how to use Cluster Linking to run a multi-region data streaming deployment without the burden and operational overhead of running yet another data system. We will discuss:.
* Automation options for creating mirror topics
* Failover processes and caveats to consider
* Handling ACL replication and consumer offset synchronization
* And more!
So, join us on this intergalactic journey to discover how you can use Cluster Linking to decrease your operational overhead, maintain a multi-geographic deployment, and perhaps even reach infinity (and beyond)!
Haitao Zhang, Uber, Software Engineer + Yang Yang, Uber, Senior Software Engineer
Kafka Consumer Proxy is a forwarding proxy that consumes messages from Kafka and dispatches them to a user registered gRPC service endpoint. With Kafka Consumer Proxy, the experience of consuming messages from Apache Kafka for pub-sub use cases is as seamless and user-friendly as receiving (g)RPC requests. In this talk, we will share (1) the motivation for building this service, (2) the high-level architecture, (3) the mechanisms we designed to achieve high availability, scalability, and reliability, and (4) the current adoption status.
https://www.meetup.com/KafkaBayArea/events/273834934/
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
High level introduction to Confluent REST Proxy and Schema Registry (leveraging Apache Avro under the hood), two components of the Apache Kafka open source ecosystem. See the concepts, architecture and features.
Why My Streaming Job is Slow - Profiling and Optimizing Kafka Streams Apps (L...confluent
Kafka Streams performance monitoring and tuning is important for many reasons, including identifying bottlenecks, achieving greater throughput, and capacity planning. In this talk we’ll share the techniques we used to achieve greater performance and save on compute, storage, and cost. We’ll cover: Identifying design bottlenecks in by reviewing logs, metrics, and serdes. State store access patterns, design, and optimization Using profiling tools such as JMX, YourKit etc. Performance tuning of Kafka and Kafka Streams configuration and properties. JVM optimization for correct heap size and garbage collection strategies. Functional programming and imperative programming trade offs.
From Mainframe to Microservice: An Introduction to Distributed SystemsTyler Treat
An introductory overview of distributed systems—what they are and why they're difficult to build. We explore fundamental ideas and practical concepts in distributed programming. What is the CAP theorem? What is distributed consensus? What are CRDTs? We also look at options for solving the split-brain problem while considering the trade-off of high availability as well as options for scaling shared data.
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
(Jason Gustafson, Confluent) Kafka Summit SF 2018
Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+.
In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following:
1. How Kafka replication works
2. What weaknesses we have found over the years
3. How these problems have been fixed
4. How we have used TLA+ to verify the fixed protocol.
This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.
Presented at Stream Processing Meetup (7/19/2018)(https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/).
At Uber, we operate 20+ Kafka clusters to collect system and application logs as well as event data from rider and driver apps. We need a Kafka replication solution to replicate data between Kafka clusters across multiple data centers for different purposes. This talk will introduce the history behind uReplicator and the high level architecture. As the original uReplicator ran into scalability challenges and operational overhead as the scale of Kafka clusters increased, we built the Federated uReplicator which addressed above issues and provide an extensible architecture for further scaling.
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
In this session, we will explore the powerful combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines. We will present a case study using the FLaNK-MTA project, which leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). By integrating Flink, NiFi, and Kafka, FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Takeaways:
Understanding the integration of Apache Flink, Apache NiFi, and Apache Kafka for real-time data processing
Insights into building scalable and fault-tolerant data processing pipelines
Best practices for data collection, transformation, and analytics with FLaNK-MTA as a reference
Knowledge of use cases and potential business impact of real-time data processing pipelines
https://github.com/tspannhw/FLaNK-MTA/tree/main
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
apache nifi
apache kafka
apache flink
apache iceberg
apache parquet
real-time streaming
tim spann
principal developer advocate
cloudera
datainmotion.dev
Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.
Building Cloud-Native App Series - Part 2 of 11
Microservices Architecture Series
Event Sourcing & CQRS,
Kafka, Rabbit MQ
Case Studies (E-Commerce App, Movie Streaming, Ticket Booking, Restaurant, Hospital Management)
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.
About the Speaker
Dikang Gu Software Engineer, Facebook
I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
Streaming all over the World: Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka.
Learn about various case studies for event streaming with Apache Kafka across industries. The talk explores architectures for real-world deployments from Audi, BMW, Disney, Generali, Paypal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track&trace, live betting, and much more.
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
Building a Real-Time Analytics Application with
Apache Pulsar and Apache Pinot
While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems.
Presenters: Mary Grygleski - Streaming Developer Advocate &
Mark Needham - Developer Relations Engineer at StarTree
Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
The Zen of High Performance Messaging with NATS NATS
The Zen of High Performance Messaging with NATS
Waldemar Quevedo Salinas, Senior Software Engineer
NATS is an open source, high performant messaging system with a design oriented towards both being as simple and reliable as possible without at the same time trading off scalability. Originally written in Ruby, and then rewritten in Go, a NATS server can nowadays push over 11M messages per second.
In this talk, we will cover how following simplicity as the main design constraint as well as focusing on a limited built-in feature set, resulted in a system which is easy to operate and reason about, making up for an attractive choice for when building many types of distributed systems where low latency and high availability are very important.
You can learn more about NATS at http://www.nats.io
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
Apache Kafak의 빅데이터 아키텍처에서 역할이 점차 커지고, 중요한 비중을 차지하게 되면서, 성능에 대한 고민도 늘어나고 있다.
다양한 프로젝트를 진행하면서 Apache Kafka를 모니터링 하기 위해 필요한 Metrics들을 이해하고, 이를 최적화 하기 위한 Configruation 설정을 정리해 보았다.
[Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안]
Apache Kafka 성능 모니터링에 필요한 metrics에 대해 이해하고, 4가지 관점(처리량, 지연, Durability, 가용성)에서 성능을 최적화 하는 방안을 정리함. Kafka를 구성하는 3개 모듈(Producer, Broker, Consumer)별로 성능 최적화를 위한 …
[Apache Kafka 모니터링을 위한 Metrics 이해]
Apache Kafka의 상태를 모니터링 하기 위해서는 4개(System(OS), Producer, Broker, Consumer)에서 발생하는 metrics들을 살펴봐야 한다.
이번 글에서는 JVM에서 제공하는 JMX metrics를 중심으로 producer/broker/consumer의 지표를 정리하였다.
모든 지표를 정리하진 않았고, 내 관점에서 유의미한 지표들을 중심으로 이해한 내용임
[Apache Kafka 성능 Configuration 최적화]
성능목표를 4개로 구분(Throughtput, Latency, Durability, Avalibility)하고, 각 목표에 따라 어떤 Kafka configuration의 조정을 어떻게 해야하는지 정리하였다.
튜닝한 파라미터를 적용한 후, 성능테스트를 수행하면서 추출된 Metrics를 모니터링하여 현재 업무에 최적화 되도록 최적화를 수행하는 것이 필요하다.
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
From Mainframe to Microservice: An Introduction to Distributed SystemsTyler Treat
An introductory overview of distributed systems—what they are and why they're difficult to build. We explore fundamental ideas and practical concepts in distributed programming. What is the CAP theorem? What is distributed consensus? What are CRDTs? We also look at options for solving the split-brain problem while considering the trade-off of high availability as well as options for scaling shared data.
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
(Jason Gustafson, Confluent) Kafka Summit SF 2018
Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+.
In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following:
1. How Kafka replication works
2. What weaknesses we have found over the years
3. How these problems have been fixed
4. How we have used TLA+ to verify the fixed protocol.
This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.
Presented at Stream Processing Meetup (7/19/2018)(https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/).
At Uber, we operate 20+ Kafka clusters to collect system and application logs as well as event data from rider and driver apps. We need a Kafka replication solution to replicate data between Kafka clusters across multiple data centers for different purposes. This talk will introduce the history behind uReplicator and the high level architecture. As the original uReplicator ran into scalability challenges and operational overhead as the scale of Kafka clusters increased, we built the Federated uReplicator which addressed above issues and provide an extensible architecture for further scaling.
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
In this session, we will explore the powerful combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines. We will present a case study using the FLaNK-MTA project, which leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). By integrating Flink, NiFi, and Kafka, FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Takeaways:
Understanding the integration of Apache Flink, Apache NiFi, and Apache Kafka for real-time data processing
Insights into building scalable and fault-tolerant data processing pipelines
Best practices for data collection, transformation, and analytics with FLaNK-MTA as a reference
Knowledge of use cases and potential business impact of real-time data processing pipelines
https://github.com/tspannhw/FLaNK-MTA/tree/main
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
apache nifi
apache kafka
apache flink
apache iceberg
apache parquet
real-time streaming
tim spann
principal developer advocate
cloudera
datainmotion.dev
Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.
Building Cloud-Native App Series - Part 2 of 11
Microservices Architecture Series
Event Sourcing & CQRS,
Kafka, Rabbit MQ
Case Studies (E-Commerce App, Movie Streaming, Ticket Booking, Restaurant, Hospital Management)
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.
About the Speaker
Dikang Gu Software Engineer, Facebook
I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
Streaming all over the World: Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka.
Learn about various case studies for event streaming with Apache Kafka across industries. The talk explores architectures for real-world deployments from Audi, BMW, Disney, Generali, Paypal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track&trace, live betting, and much more.
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
Building a Real-Time Analytics Application with
Apache Pulsar and Apache Pinot
While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems.
Presenters: Mary Grygleski - Streaming Developer Advocate &
Mark Needham - Developer Relations Engineer at StarTree
Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
The Zen of High Performance Messaging with NATS NATS
The Zen of High Performance Messaging with NATS
Waldemar Quevedo Salinas, Senior Software Engineer
NATS is an open source, high performant messaging system with a design oriented towards both being as simple and reliable as possible without at the same time trading off scalability. Originally written in Ruby, and then rewritten in Go, a NATS server can nowadays push over 11M messages per second.
In this talk, we will cover how following simplicity as the main design constraint as well as focusing on a limited built-in feature set, resulted in a system which is easy to operate and reason about, making up for an attractive choice for when building many types of distributed systems where low latency and high availability are very important.
You can learn more about NATS at http://www.nats.io
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
Apache Kafak의 빅데이터 아키텍처에서 역할이 점차 커지고, 중요한 비중을 차지하게 되면서, 성능에 대한 고민도 늘어나고 있다.
다양한 프로젝트를 진행하면서 Apache Kafka를 모니터링 하기 위해 필요한 Metrics들을 이해하고, 이를 최적화 하기 위한 Configruation 설정을 정리해 보았다.
[Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안]
Apache Kafka 성능 모니터링에 필요한 metrics에 대해 이해하고, 4가지 관점(처리량, 지연, Durability, 가용성)에서 성능을 최적화 하는 방안을 정리함. Kafka를 구성하는 3개 모듈(Producer, Broker, Consumer)별로 성능 최적화를 위한 …
[Apache Kafka 모니터링을 위한 Metrics 이해]
Apache Kafka의 상태를 모니터링 하기 위해서는 4개(System(OS), Producer, Broker, Consumer)에서 발생하는 metrics들을 살펴봐야 한다.
이번 글에서는 JVM에서 제공하는 JMX metrics를 중심으로 producer/broker/consumer의 지표를 정리하였다.
모든 지표를 정리하진 않았고, 내 관점에서 유의미한 지표들을 중심으로 이해한 내용임
[Apache Kafka 성능 Configuration 최적화]
성능목표를 4개로 구분(Throughtput, Latency, Durability, Avalibility)하고, 각 목표에 따라 어떤 Kafka configuration의 조정을 어떻게 해야하는지 정리하였다.
튜닝한 파라미터를 적용한 후, 성능테스트를 수행하면서 추출된 Metrics를 모니터링하여 현재 업무에 최적화 되도록 최적화를 수행하는 것이 필요하다.
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
Until recently, the Messaging team at Twitter had been running an in-house build Pub/Sub system, namely EventBus (built on top of Apache DistributedLog and Apache Bookkeeper, and similar in architecture to Apache Pulsar) to cater to our pubsub needs. In 2018, we made the decision to move to Apache Kafka by migrating existing use cases as well as onboarding new use cases directly onto Apache Kafka. Fast forward to today, Kafka is now an essential piece of Twitter Infrastructure and processes over 200M messages per second. In this talk, we will share the learning and challenges in our journey moving to Apache Kafka.
Data Con LA 2019 - Unifying streaming and message queue with Apache Kafka by ...Data Con LA
In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.Given the scope and pace at which Uber operates, our systems must be fault-tolerant and uncompromising when it comes to failing intelligently. In particular, in streaming processing and event driven architecture, supporting reliable redeliveries with dead letter queues is a popular ask from many real-time applications and services at Uber.To accomplish this, we leverage Apache Kafka, a popular open source distributed pub/sub messaging platform, which has been industry-tested for delivering high performance at scale. We build competing consumption semantics with dead letter queues on top of existing Kafka APIs and provide interfaces to ack or nack out of order messages with retries and in-process fanout features. We will also talk a bit about several use cases around that such as driver/rider matching, driver incentive payment etc.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsTimothy Spann
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
https://portotechhub.com/conference-2021/
Timothy Spann
Developer Advocate
StreamNative
A cloud data lake that is empty is not useful to anyone.
How can you quickly, scalably and reliably fill your cloud data lake with diverse sources of data you already have and new ones you never imagined you needed. Utilizing open source tools from Apache, the FLiP stack enables any data engineer, programmer or analyst to build reusable modules with low or no code. FLiP utilizes Apache NiFi, Apache Pulsar, Apache Flink and MiNiFi agents to load CDC, Logs, REST, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before.
I will teach you how to fish in the deep end of the lake and return a data engineering hero. Let's hope everyone is ready to go from 0 to Petabyte hero.
TRACK RIBEIRA Fri 07:00 — 50 min
19-Nov-2021
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent
Speaker: Yupeng Fu, Staff Engineer, Uber
High availability and reliability are important requirements to Uber services, and the services shall tolerate datacenter failures in a region and fail over to another region. In this talk, we will present the active-active Apache Kafka® at Uber and how it facilitates disaster discovery across regions for Uber services. In particular, we will highlight the key components including topic replication, topic aggregation, offsets sync and then walk through several use cases of their disaster recovery strategy using active-active Kafka. Lastly, we will present several interesting challenges and the future work planned.
Yupeng Fu is a staff engineer in Uber Data Org leading the streaming data platform. Previously, he worked at Alluxio and Palantir, building distributed data analysis and storage platforms. Yupeng holds a B.S. and an M.S. from Tsinghua University and did his Ph.D. research on databases at UCSD.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
Confluent Operator as Cloud-Native Kafka Operator for KubernetesKai Wähner
Agenda:
- Cloud Native vs. SaaS / Serverless Kafka
- The Emergence of Kubernetes
- Kafka on K8s Deployment Challenges
- Confluent Operator as Kafka Operator
- Q&A
Confluent Operator enables you to:
Provisioning, management and operations of Confluent Platform (including ZooKeeper, Apache Kafka, Kafka Connect, KSQL, Schema Registry, REST Proxy, Control Center)
Deployment on any Kubernetes Platform (Vanilla K8s, OpenShift, Rancher, Mesosphere, Cloud Foundry, Amazon EKS, Azure AKS, Google GKE, etc.)
Automate provisioning of Kafka pods in minutes
Monitor SLAs through Confluent Control Center or Prometheus
Scale Kafka elastically, handle fail-over & Automate rolling updates
Automate security configuration
Built on our first hand knowledge of running Confluent at scale
Fully supported for production usage
What's new in confluent platform 5.4 online talkconfluent
To stay informed about the latest features in Confluent Platform 5.4 join Martijn Kieboom Solutions Engineer at Confluent, for the ‘What’s New in Confluent 5.4?’ on February 12 at 11 am GMT/ 12 Noon CET. Martijn will talk through the new features including:
Role-Based Access Control and how it enables highly granular control of permissions and platform access
Structured Audit Logs and how they enable the capture of authorization logs
How Multi-Region Clusters deliver asynchronous replication at the topic level, allowing companies to run a single Kafka Cluster across multiple data-centres
Schema validations role in enabling businesses that run Kafka at scale to deliver data compatibility across platforms
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
Modernizing analytics data pipelines to gain the most of your data while optimizing costs can be challenging. However, today cloud providers offer a good set of services that can help with this endeavor. We will do a tour across some GCP services during this hands-on session, using DataFlow (apache beam) as the backbone to architect a modern analytics pipeline to wire them all together.
Build real time stream processing applications using Apache KafkaHotstar
This talk was presented at the Hotstar Scale Meetup in Bangalore by Jayesh Sidhwani
In this talk, the presenter introduces Apache Kafka and the Apache Kafka Streams library. Starting from the need for building streaming applications to thinking the use-cases as a streaming job - this talk covers all the technicalities.
It ends with a short description of how Kafka is deployed and used at Hotstar
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Similar to Kafka Practices @ Uber - Seattle Apache Kafka meetup (20)
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
18. Producer Libraries
● High Throughput (average case)
○ Non-blocking, async, batched
● At-least-once (critical use case)
○ Blocking, sync
● Topic Discovery
○ Discovers the kafka cluster a topic belongs
○ Able to multiplex to different kafka clusters
20. Kafka Local Agent
● Producer side persistence
○ Local storage
● Isolates clients from downstream outages, backpressure
● Controlled backfill upon recovery
○ Prevents from overwhelming a recovering cluster
24. Kafka Rest Proxy: Internals
● Based on Confluent’s open sourced Rest Proxy
● Performance enhancements
○ Simple HTTP servlets on jetty instead of Jersey
○ Optimized for binary payloads.
○ Performance increase from 7K* to 45K QPS/box
● Caching of topic metadata
● Reliability improvements*
○ Support for Fallback cluster
○ Support for multiple producers (SLA-based segregation)
● Plan to contribute back to community
*Based on benchmarking & analysis done in Jun ’2015
26. Kafka Secondary Cluster
● High availability on regional cluster failure
● Rest proxy produces Secondary Cluster on Regional Cluster
failure
● uReplicator/Mirrormaker backfill data back to regional cluster
on recovery
31. At-Least-Once
Application Process
ProxyClient
Kafka Proxy Server uReplicator
1
2
3 5 7
64 8
Regional Kafka Aggregate Kafka
● Most of infrastructure tuned for high throughput
○ Batching at each stage
○ Ack before being persisted (ack’ed != committed)
● Single node failure in any stage leads to data loss
● Need a reliable pipeline for High Value Data e.g. Payments
32. At-least-once Kafka: Data Flow
Application Process
ProxyClient
Kafka Proxy Server uReplicator
1
6
2 3 7
45 8
Regional Kafka Aggregate Kafka
35. Offset Sync Service
● Used for syncing offset between aggregate clusters on
failover
● Mirrormaker periodically snapshot regional offset to
aggregate offset map to external datastore
● Use offset map to recover safe consumer offset to resume
from in passive DC
39. Chaperone - End to End Auditing
● In-house Auditing Solution for Kafka
● Running in Production for ~2 Years
○ Audit 20k+ topics for 99.99% completeness
● Open Sourced: https://github.com/uber/chaperone
● Uber Engineering Blog: https://eng.uber.com/chaperone/
[George]
Uber as a product is the realtime movement of people and things.
As a result, Kafka (Stream processing) is a critical component of many real time systems at uber.
[George]
Rider app sends information to our servers, which is fed to Kafka.
Driver app sends information to serves, which is fed to Kafka.
This info is passed to stream processing framework, which does useful calculations.
Then info is passed back to the user in the form of:
Match
Routing info
ETA
Promote Uber eats....
ETAs change based on timings. Need historical input on all trips i.e. submission time, preparation time, pickup time etc... More complex than rider app because there is an offline component.
[George]
Of course, this is just the tip of a very large iceberg
[George]
General pub sub between services
Kafka is the basis of all Stream Processing systems at Uber. AthenaX (our self-serve platform) is built on top of Kafka. AthenaX uses Samza / Flink
All data that needs to be ingested is written to Kafka.
Changelog transport. Slightly different from the above use-cases because of ordering & durability guarantees
Logging is used to feed ELK
[George]
We are one of the largest users of Kafka.
[George]
Excluding replication
[George]
[George]
[George]
Kafka is the hub in Uber’s data infrastructure.
On the left side, we can find many kinds of applications and services. They generate data or logs and send them to Kafka.
At the other side, we have stream processing engine, batch processing engines & various services to process the data.
Now, let’s look a bit deeper in the Kafka box
Highlight surge as an important use case to maintain marketplace health?
For example,
Surg
Surge adjusts the prices based on demand/supply statistics, which is derived from data generated by rider and driver apps.
ELK index log msgs for troubleshooting.
Samza, Flink are general stream processing engines, used to find insight from the dataset in real time.
While Hadoop represents the set of tools to process the data in batches.
Meanwhile, data in Kafka are copied to HDFS and S3 for long term backup.
[George]
[George]
[George]
We are not using a single giant Kafka cluster in datacenter,
since Kafka itself does not have good support for multi-tenancy and resource isolation.
Instead, we have setup multiple clusters to support specific use cases. For example,
We have dedicated cluster for Surge, which is super critical for Uber business.
And we have a cluster for logging topics, which needs very high throughput.
Besides, we have a secondary cluster in each data center,
which accepts data from REST proxy if primary kafka goes down.
[George]
This is a high level overview of the Kafka architecture at Uber.
Multiple DC
Producer -> Rest Proxy -> DC Local Regional Cluster -> Mirrormaker/Ureplicator -> Agg Cluster (Global view of data)
[George]
Next half of presentation will cover some of the components we’ve added to scale Kafka at Uber:
Producer Library/Local Agent [Mingmin]
Rest Proxy [Mingmin]
Secondary [Mingmin]
Ureplicator [Mingmin]
OffsetSyncService [George]
Transition: Mingmin will discuss the producer side components.
[Mingmin]
Essentially, client libraries are HTTP clients.
But we use many techniques inside to achieve high throughput and low produce latency
Ilke, non-blocking/async and batching.
Produce latency is how long it takes to call produce() and returns back from the method call.
End2end latency is how long it takes for consumers to see the data.
As mentioned, we have multiple Kafka clusters.
Client library needs to discover which cluster the topic belongs to and sends msg there.
What’s more, client library integrates with LocalAgent to ensure data reliability.
We’re going to talk about this in following section.
[Mingmin]
[Mingmin]
LocalAgent is deployed on every host. Has come in handy in production on several occasions. It’s been designed to use minimal resource, so that it won’t affect services on that host.
When REST proxy fails, the data from client fail over to LocalAgent, which keeps data until RP goes back.
And when RP is back, the backfilling rate is controlled to avoid overloading RP.
Data stored on disk uses the Kafka ‘Log’
[Mingmin]
[Mingmin]
And here we build this pipeline to address those requirements.
Basically, in each data center, there is a regional Kafka cluster.
In front of it, we setup Kafka REST proxy, which is web service essentially.
Applications use proxy client to publish data to Kafka.
At the other end, we have aggregate Kafka cluster.
uReplicator copies data from multiple regional clusters into the aggregate cluster.
Besides, LocalAgent and SecondaryKafka are used for fault tolerance purpose.
[Mingmin]
So why build it? Why not publish to Kafka directly?
First of all, it simplifies the implementation of client library,
Therefore, makes it feasible to support multiple language.
Kafka protocol is not well documented and hard to implement.
But with Rest Proxy, the client library is http client essentially.
Secondly, it decouples client and kafka cluster. This makes
Kafka maintenance easier to conduct and transparent to end users.
What’s more, the connection to Kafka brokers are reduced a lot.
Besides, we have built quota management in RestProxy to ensure
abnormal producer won’t affect the normal ones.
[Mingmin]
[Mingmin]
The regional clusters are just regular Kafka clusters, but we have a secondary cluster in DC, which guarantees HA when regional cluster is unavailable.
[Mingmin]
[Mingmin]
uReplicator copies data from multiple regional clusters into the aggregate cluster.
Replacement for the open source mirrormaker
[Mingmin]
Copies thousands of topics between clusters.
Why did we build it?
Long rebalance times. Upto 20 mins:
Apache Helix lets us embed customized balancing logic in case certain works are heavily loaded
[Mingmin]
[Mingmin]
[Mingmin]
Most of our Kafka clusters are tuned for high throughput by batching and async techniques.
By tuning the configuration and patching few parts of the pipeline,
the data can be shipped over without any loss.
[Mingmin]
[George]
Consumers may consume from two different places:
Regional Kafka clusters
Global Aggregate Cluster to see a global view of data
[George]
[George]
[George]
Chaperone is embedded in or deployed for all the components along the pipeline to count every message flow through it.
The audit results are stored in Cassandra so that users can query them to check if there is msg loss or delay.
In Chaperone, the different kind of components are called tiers, like Rest_proxy_tier or regional_tier, aggregate tier.
The rest proxy and client libraries publish counts to the Chaperone Web Service
Chaperone then consumes from the Kafka tiers and finally generates a report per-topic on the amount of data in each tier during a given 10 minute window
If counts during a window differ by more 0.01% (i.e. 99.99% completeness), an alert is triggered
[George]
If there is no loss, msg count is supposed to be same at each tier.
If there is loss, the gap in the figure highlights when the loss happened and by how much.
((For example, 10 msg are generated between 11:00am and 11:10am.
When those 10 msg arrive at regional broker, an audit msg saying that
10 msg generated between this 10min has arrived at regional broker
can be generated and stored in database.
So, we can check if those 10 msg generated between this 10min has reached all components.))
[George]
Besides, Chaperone tracks msg latency and msg rate.