What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
Under-replicated Partitions: The mother of all metrics
Request Latencies: Why your users complain
Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
Kafka has become the de facto standard for streaming data with high-throughput, low-latency, and fault-tolerance. However, its rising adoption raises new challenges. In particular, the growing cluster sizes, increasing volume and diversity of user traffic, and aging network and server components induce an overhead in managing the system. This overhead makes it infeasible for human operators to constantly monitor, identify, and mitigate issues. The resulting utilization imbalance across brokers leads to unpredictable client performance due to the high variation in their throughput and latency. Finally, properly expanding, shrinking, or upgrading clusters also incurs a management overhead. Hence, adopting a principled approach to manage Kafka clusters is integral to the sustainability of the infrastructure.
This talk will describe how LinkedIn alleviates the management overhead of large-scale Kafka clusters using Cruise Control. To this end, first, we will discuss the reactive and proactive techniques that Cruise Control uses to support admin operations for cluster maintenance, enable anomaly detection with self-healing, and provide real-time monitoring for Kafka clusters. Next, we will examine how Cruise Control performs in production. Finally, we will conclude with questions and further discussion.
Haitao Zhang, Uber, Software Engineer + Yang Yang, Uber, Senior Software Engineer
Kafka Consumer Proxy is a forwarding proxy that consumes messages from Kafka and dispatches them to a user registered gRPC service endpoint. With Kafka Consumer Proxy, the experience of consuming messages from Apache Kafka for pub-sub use cases is as seamless and user-friendly as receiving (g)RPC requests. In this talk, we will share (1) the motivation for building this service, (2) the high-level architecture, (3) the mechanisms we designed to achieve high availability, scalability, and reliability, and (4) the current adoption status.
https://www.meetup.com/KafkaBayArea/events/273834934/
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Improving Kafka at-least-once performance at UberYing Zheng
At Uber, we are seeing an increasing demand for Kafka at-least-once delivery (asks=all). So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When trying to allow at-least-once producing on the regular Kafka clusters, the producing performance was the main concern. We spent some effort on this issue in the recent months, and managed to reduce at-least-once producer latency by about 80% with code changes and configuration tuning. When acks=0, these improvements also help increasing Kafka throughput and reducing Kafka end-to-end latency.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
Kafka has become the de facto standard for streaming data with high-throughput, low-latency, and fault-tolerance. However, its rising adoption raises new challenges. In particular, the growing cluster sizes, increasing volume and diversity of user traffic, and aging network and server components induce an overhead in managing the system. This overhead makes it infeasible for human operators to constantly monitor, identify, and mitigate issues. The resulting utilization imbalance across brokers leads to unpredictable client performance due to the high variation in their throughput and latency. Finally, properly expanding, shrinking, or upgrading clusters also incurs a management overhead. Hence, adopting a principled approach to manage Kafka clusters is integral to the sustainability of the infrastructure.
This talk will describe how LinkedIn alleviates the management overhead of large-scale Kafka clusters using Cruise Control. To this end, first, we will discuss the reactive and proactive techniques that Cruise Control uses to support admin operations for cluster maintenance, enable anomaly detection with self-healing, and provide real-time monitoring for Kafka clusters. Next, we will examine how Cruise Control performs in production. Finally, we will conclude with questions and further discussion.
Haitao Zhang, Uber, Software Engineer + Yang Yang, Uber, Senior Software Engineer
Kafka Consumer Proxy is a forwarding proxy that consumes messages from Kafka and dispatches them to a user registered gRPC service endpoint. With Kafka Consumer Proxy, the experience of consuming messages from Apache Kafka for pub-sub use cases is as seamless and user-friendly as receiving (g)RPC requests. In this talk, we will share (1) the motivation for building this service, (2) the high-level architecture, (3) the mechanisms we designed to achieve high availability, scalability, and reliability, and (4) the current adoption status.
https://www.meetup.com/KafkaBayArea/events/273834934/
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Improving Kafka at-least-once performance at UberYing Zheng
At Uber, we are seeing an increasing demand for Kafka at-least-once delivery (asks=all). So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When trying to allow at-least-once producing on the regular Kafka clusters, the producing performance was the main concern. We spent some effort on this issue in the recent months, and managed to reduce at-least-once producer latency by about 80% with code changes and configuration tuning. When acks=0, these improvements also help increasing Kafka throughput and reducing Kafka end-to-end latency.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Speaker: Jun Rao, VP of Apache Kafka and Co-founder of Confluent
The controller is the brain of Apache Kafka®. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
In this talk, Jun will outline the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker. Jun will then describe recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Jun Rao is the co-founder of Confluent, a company that provides a streaming data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM's Almaden research datacenter, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Cassandra. He writes at https://cnfl.io/blog-jun-rao.
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...HostedbyConfluent
Many organizations use Apache Kafka® to build data pipelines that span multiple geographically distributed data centers, for use cases ranging from high availability and disaster recovery, to data aggregation and regulatory compliance.
The journey from single-cluster deployments to multi-cluster deployments can be daunting, as you need to deal with networking configurations, security models and operational challenges. Geo-replication support for Kafka has come a long way, with both open-source and commercial solutions that support various replication topologies and disaster recovery strategies.
So, grab your towel, and join us on this journey as we look at tools, practices, and patterns that can help us build reliable, scalable, secure, global (if not inter-galactic) data pipelines that meet your business needs, and might even save the world from certain destruction.
Full recorded presentation at https://www.youtube.com/watch?v=2UfAgCSKPZo for Tetrate Tech Talks on 2022/05/13.
Envoy's support for Kafka protocol, in form of broker-filter and mesh-filter.
Contents:
- overview of Kafka (usecases, partitioning, producer/consumer, protocol);
- proxying Kafka (non-Envoy specific);
- proxying Kafka with Envoy;
- handling Kafka protocol in Envoy;
- Kafka-broker-filter for per-connection proxying;
- Kafka-mesh-filter to provide front proxy for multiple Kafka clusters.
References:
- https://adam-kotwasinski.medium.com/deploying-envoy-and-kafka-8aa7513ec0a0
- https://adam-kotwasinski.medium.com/kafka-mesh-filter-in-envoy-a70b3aefcdef
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
Apache Kafak의 빅데이터 아키텍처에서 역할이 점차 커지고, 중요한 비중을 차지하게 되면서, 성능에 대한 고민도 늘어나고 있다.
다양한 프로젝트를 진행하면서 Apache Kafka를 모니터링 하기 위해 필요한 Metrics들을 이해하고, 이를 최적화 하기 위한 Configruation 설정을 정리해 보았다.
[Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안]
Apache Kafka 성능 모니터링에 필요한 metrics에 대해 이해하고, 4가지 관점(처리량, 지연, Durability, 가용성)에서 성능을 최적화 하는 방안을 정리함. Kafka를 구성하는 3개 모듈(Producer, Broker, Consumer)별로 성능 최적화를 위한 …
[Apache Kafka 모니터링을 위한 Metrics 이해]
Apache Kafka의 상태를 모니터링 하기 위해서는 4개(System(OS), Producer, Broker, Consumer)에서 발생하는 metrics들을 살펴봐야 한다.
이번 글에서는 JVM에서 제공하는 JMX metrics를 중심으로 producer/broker/consumer의 지표를 정리하였다.
모든 지표를 정리하진 않았고, 내 관점에서 유의미한 지표들을 중심으로 이해한 내용임
[Apache Kafka 성능 Configuration 최적화]
성능목표를 4개로 구분(Throughtput, Latency, Durability, Avalibility)하고, 각 목표에 따라 어떤 Kafka configuration의 조정을 어떻게 해야하는지 정리하였다.
튜닝한 파라미터를 적용한 후, 성능테스트를 수행하면서 추출된 Metrics를 모니터링하여 현재 업무에 최적화 되도록 최적화를 수행하는 것이 필요하다.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance.
Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...HostedbyConfluent
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Current 2022
Can you answer how a given event came to be? Is it an aggregation, a combination of multiple events with different sources? What are its origins?
Given the growing complexity of event streaming architectures - stateful processing, joins, fan-outs, multi-cluster flows - it is increasingly important to be able to accurately answer those questions, understand data flows and capture data provenance.
This talk will walk through how to use and extend OpenTelemetry Java agent auto instrumentation to achieve full end-to-end traceability in Kafka event streaming architectures involving multi-cluster deployments, the Connect platform, stateful KStream applications and ksqlDB workloads.
We will cover:
- Distributed Tracing concepts - context propagation and the OpenTelemetry implementation stack;
- Java agent auto instrumentation, problems faced when instrumenting service platforms (Connect and ksqlDB), stateful applications (KStreams and ksqlDB) and how auto instrumentation can be extended using loadable extensions to solve those problems;
- Demo of an end-to-end tracing implementation and a highlight of the interesting use cases it enables.
Alexander Sapin from Yandex presents reasoning, design considerations, and implementation of ClickHouse Keeper. It replaces ZooKeeper in ClickHouse clusters, thereby simplifying operation enormously.
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico
HighLoad++ 2017
Зал «Дели + Калькутта», 8 ноября, 17:00
Тезисы:
http://www.highload.ru/2017/abstracts/2978.html
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
...
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines.
In this presentation we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually be misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Speaker: Jun Rao, VP of Apache Kafka and Co-founder of Confluent
The controller is the brain of Apache Kafka®. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
In this talk, Jun will outline the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker. Jun will then describe recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Jun Rao is the co-founder of Confluent, a company that provides a streaming data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM's Almaden research datacenter, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Cassandra. He writes at https://cnfl.io/blog-jun-rao.
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...HostedbyConfluent
Many organizations use Apache Kafka® to build data pipelines that span multiple geographically distributed data centers, for use cases ranging from high availability and disaster recovery, to data aggregation and regulatory compliance.
The journey from single-cluster deployments to multi-cluster deployments can be daunting, as you need to deal with networking configurations, security models and operational challenges. Geo-replication support for Kafka has come a long way, with both open-source and commercial solutions that support various replication topologies and disaster recovery strategies.
So, grab your towel, and join us on this journey as we look at tools, practices, and patterns that can help us build reliable, scalable, secure, global (if not inter-galactic) data pipelines that meet your business needs, and might even save the world from certain destruction.
Full recorded presentation at https://www.youtube.com/watch?v=2UfAgCSKPZo for Tetrate Tech Talks on 2022/05/13.
Envoy's support for Kafka protocol, in form of broker-filter and mesh-filter.
Contents:
- overview of Kafka (usecases, partitioning, producer/consumer, protocol);
- proxying Kafka (non-Envoy specific);
- proxying Kafka with Envoy;
- handling Kafka protocol in Envoy;
- Kafka-broker-filter for per-connection proxying;
- Kafka-mesh-filter to provide front proxy for multiple Kafka clusters.
References:
- https://adam-kotwasinski.medium.com/deploying-envoy-and-kafka-8aa7513ec0a0
- https://adam-kotwasinski.medium.com/kafka-mesh-filter-in-envoy-a70b3aefcdef
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
Apache Kafak의 빅데이터 아키텍처에서 역할이 점차 커지고, 중요한 비중을 차지하게 되면서, 성능에 대한 고민도 늘어나고 있다.
다양한 프로젝트를 진행하면서 Apache Kafka를 모니터링 하기 위해 필요한 Metrics들을 이해하고, 이를 최적화 하기 위한 Configruation 설정을 정리해 보았다.
[Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안]
Apache Kafka 성능 모니터링에 필요한 metrics에 대해 이해하고, 4가지 관점(처리량, 지연, Durability, 가용성)에서 성능을 최적화 하는 방안을 정리함. Kafka를 구성하는 3개 모듈(Producer, Broker, Consumer)별로 성능 최적화를 위한 …
[Apache Kafka 모니터링을 위한 Metrics 이해]
Apache Kafka의 상태를 모니터링 하기 위해서는 4개(System(OS), Producer, Broker, Consumer)에서 발생하는 metrics들을 살펴봐야 한다.
이번 글에서는 JVM에서 제공하는 JMX metrics를 중심으로 producer/broker/consumer의 지표를 정리하였다.
모든 지표를 정리하진 않았고, 내 관점에서 유의미한 지표들을 중심으로 이해한 내용임
[Apache Kafka 성능 Configuration 최적화]
성능목표를 4개로 구분(Throughtput, Latency, Durability, Avalibility)하고, 각 목표에 따라 어떤 Kafka configuration의 조정을 어떻게 해야하는지 정리하였다.
튜닝한 파라미터를 적용한 후, 성능테스트를 수행하면서 추출된 Metrics를 모니터링하여 현재 업무에 최적화 되도록 최적화를 수행하는 것이 필요하다.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance.
Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...HostedbyConfluent
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Current 2022
Can you answer how a given event came to be? Is it an aggregation, a combination of multiple events with different sources? What are its origins?
Given the growing complexity of event streaming architectures - stateful processing, joins, fan-outs, multi-cluster flows - it is increasingly important to be able to accurately answer those questions, understand data flows and capture data provenance.
This talk will walk through how to use and extend OpenTelemetry Java agent auto instrumentation to achieve full end-to-end traceability in Kafka event streaming architectures involving multi-cluster deployments, the Connect platform, stateful KStream applications and ksqlDB workloads.
We will cover:
- Distributed Tracing concepts - context propagation and the OpenTelemetry implementation stack;
- Java agent auto instrumentation, problems faced when instrumenting service platforms (Connect and ksqlDB), stateful applications (KStreams and ksqlDB) and how auto instrumentation can be extended using loadable extensions to solve those problems;
- Demo of an end-to-end tracing implementation and a highlight of the interesting use cases it enables.
Alexander Sapin from Yandex presents reasoning, design considerations, and implementation of ClickHouse Keeper. It replaces ZooKeeper in ClickHouse clusters, thereby simplifying operation enormously.
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico
HighLoad++ 2017
Зал «Дели + Калькутта», 8 ноября, 17:00
Тезисы:
http://www.highload.ru/2017/abstracts/2978.html
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
...
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines.
In this presentation we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually be misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements.
Topics include:
- What latencies and throughputs you should expect from Kafka
- How to select hardware and size components
- What you should be monitoring
- Design patterns and antipatterns for client applications
- How to go about diagnosing performance bottlenecks
- Which configurations to examine and which ones to avoid
Monitoring Apache Kafka
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines.
In this presentation, we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.
Resilience Planning & How the Empire Strikes BackC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1pGpnbd.
Bhakti Mehta approaches best practices for building resilient, stable and predictable services: preventing cascading failures, timeouts pattern, retry pattern, circuit breakers and other techniques which have been pervasively used at Blue Jeans Network. Filmed at qconsf.com.
Bhakti Mehta is the author of "RESTful Java Patterns and Best practices” and "Developing RESTful Services with JAX-RS 2.0, WebSockets, and JSON”. Bhakti is a Senior Software Engineer at Blue Jeans Network. As part of her current role, she works on developing RESTful services that can be consumed by ISV partners and the developer community.
Make It Cooler: Using Decentralized Version Controlindiver
A commonly used version control system in the ColdFusion community is Subversion -- a centralized system that relies on being connected to a central server. The next generation version control systems are “decentralized”, in that version control tasks do not rely on a central server.
Decentralized version control systems are more efficient and offer a more practical way of software development.
In this session, Indy takes you through the considerations in moving from Subversion to Git, a decentralized version control system. You also get to understand the pros and cons of each and hear of the practical experience of migrating projects to decentralized version control.
Version control is often used in conjunction with a testing framework and continuous integration. To complete the picture, Indy walks you through how to integrate Git with a testing framework, MXUnit, and a continuous integration server, Hudson.
Talk on Production ready microservices at Scale in Ruby. It talks about Production Readiness checklist for building microservices which are stable, reliable, secure, fault tolerant and prepared for catastrophe.
Benchmarking NGINX for Accuracy and ResultsNGINX, Inc.
View full webinar on demand at http://bit.ly/nginxbenchmarking
Whether you’re doing performance testing or planning for infrastructure needs, benchmarking can be a big deal. Join us for this webinar where we cover NGINX benchmarking best practices, including:
- the test environment
- configuring NGINX
- using benchmarking tools
- and more!
You’ll learn how to approach doing benchmarks so that you obtain results that are more accurate, better understood, and do a better job of addressing the needs of your project.
Cassandra is pretty awesome, sure I am biased, but it rocks. Always on, tuneable consistency and multi-master architecture? Let’s get our web scale on and build a highly available app that never goes down!
Hold on a second. There is one key piece of the puzzle that has a massive impact on your applications availability: the client driver.
In this talk we will go through the how to best configure your clients to make the most of failure handling and tuneable consistency in Cassandra.
Adding Real-time Features to PHP ApplicationsRonny López
It's possible to introduce real-time features to PHP applications without deep modifications of the current codebase.
Using WAMP you can build distributed systems out of application components which are loosely coupled and communicate in (soft) real-time.
There is no need to learn a whole new language, with the implications it has.
It also opens the door to write reactive, event-based, distributed architectures and to achieve easier scalability by distributing messages to multiple systems.
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
My talk at Confoo 2016 Montreal
It is well said that "The more you sweat on the field, the less you bleed in war". Failures are an inevitable part of complex systems. Accepting that failures happen, will help you design the system's reactions to specific failures.
This talks on best practices for building resilient, stable and predictable services:
preventing Cascading failures, Timeouts pattern, Retry pattern,Circuit breakers
and many more techniques in microservices
Design Review Best Practices - SREcon 2014Mandi Walls
Design reviews are the foundation for a successful product or feature launch. In this session we will broach a few of the critical questions an SRE asks during the design review process to ensure the design and deployment will result in a sustainable system. We will cover real world examples of the pitfalls of not engaging the operations/infrastructure team early in the process.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1vfO62b.
Lisa Van Gelder provides simple tips and tricks for improving delivery without investing lots of time up front creating complex deployment frameworks. Filmed at qconsf.com.
Lisa Van Gelder is a Senior Consultant at Cyrus Innovation where she works with companies to build and deliver software solutions, improve their software development process, and speed up delivery.
This tutorial gives out an brief and interesting introduction to modern stream computing technologies. The participants can learn the essential concepts and methodologies for designing and building a advanced stream processing system. The tutorial unveils the key fundamentals behind various kinds of design choices. Some forecast of technology developments in this domain is also introduced at the last section of this tutorial.
Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino
Increasingly, technical organizations are developing career paths to build and recognize leaders outside of the traditional management roles. But what should an SRE who wants to be a leader be focusing on? Through the eyes of an engineer who reinvented his career in one of the largest SRE organizations, we will examine what technical leadership looks like, and how an individual can help guide the strategic path of a team, department, or company without taking on the role of a people manager. You'll pick up tactical work that you can start immediately to set yourself up for success, and some pointers to be able to identify the opportunities when they show up.
From Operations to Site Reliability in Five Easy StepsTodd Palino
Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE): an IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of technology giants, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software.
In this session, Todd Palino from LinkedIn explores how SRE evolves from Operations by taking the ‘lid-off’ SRE at LinkedIn. He’ll describe how by crafting automation, problem solving, and building a partnership with software engineering teams, companies can build a high-trust and inclusive team culture that is needed to drive continuous improvement — and importantly, have lots of fun doing it!
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success.
We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.
Monitoring services is easy, right? Set up a notification that goes out when a certain number increases past a certain threshold to let you know that there’s a problem. But if that’s the case, why are so many teams drowning in alerts and dreading their time on call? The reason is that we tend to monitor the wrong things: reactive alerts, metrics that we don’t completely understand how they impact our service, and capacity alerts. We look at our own view of the service and fail to consider that our customers have a different view.
Come learn to let go of what does not help, and explore how to monitor for what truly matters: what the customer sees. This starts with defining our agreements with our customers, continues through building applications intelligently and instrumenting all the things, and finishes with picking the right signals out of that instrumentation to generate alerts that are actionable, not ones that introduce confusion and noise. We will also touch on capacity planning, and how it should never wake you up. You’ll find it’s possible to assure that you meet your service level objectives while still maximizing your sleep level objectives.
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino
Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE); a new IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of web-scale businesses, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software.
In this session, Todd Palino from LinkedIn explores SRE from organizational, team and individual perspectives. He’ll describe how by crafting automation and problem solving, SRE can permeate across a technical organization – not only ensuring a massively high-performant and always available site, but used to inform optimum decision making - in everything from system procurement to application design, builds and deployment.
Todd will talk in depth about what constitutes the best in SRE in a DevOps world, using examples to examine the techniques needed to accelerate value and grow teams. Taking the ‘lid-off’ SRE at LinkedIn, join Todd as he describes how it started and continues to evolve, what goals are important, and how it’s instrumental in building a high-trust and inclusive team culture needed to drive continuous improvement -- and importantly, have lots of fun doing it!
Kafka makes so many things easier to do, from managing metrics to processing streams of data. Yet it seems that so many things we have done to this point in configuring and managing it have been object studies in how to make our lives, as the plumbers who keep the data flowing, more difficult than they have to be. What are some of our favorites?
* Kafka without access controls
* Multitenant clusters with no capacity controls
* Worrying about message schemas
* MirrorMaker inefficiencies
* Hope and pray log compaction
* Configurations as shared secrets
* One-way upgrades
We’ve made a lot of progress over the last few years improving the situation, in part by focusing some of this incredibly talented community towards operational concerns. We’ll talk about the big mistakes you can avoid when setting up multi-tenant Kafka, and some that you still can’t. And we will talk about how to continue down the path of marrying the hot, new features with operational stability so we can all continue to come back here every year to talk about it.
I'm No Hero: Full Stack Reliability at LinkedInTodd Palino
The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to.
At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone.
Description:
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Multi tier, multi-tenant, multi-problem kafkaTodd Palino
At LinkedIn, the Kafka infrastructure is run as a service: the Streaming team develops and deploys Kafka, but is not the producer or consumer of the data that flows through it. With multiple datacenters, and numerous applications sharing these clusters, we have developed an architecture with multiple pipelines and multiple tiers. Most days, this works out well, but it has led to many interesting problems. Over the years we have worked to develop a number of solutions, most of them open source, to make it possible for us to reliably handle over a trillion messages a day.
Presented at Kafka Summit 2016
Operating out of multiple datacenters is a large part of most disaster recovery plans, but it brings extra complications to our data pipelines. Instead of having a straight path from front to back, it now has forks and dead ends and odd little use cases that don’t match up with a perfect view of the world. This talk will focus on how to best utilize Apache Kafka in this world, including basic architectures for multi-datacenter and multi-tier clusters. We will also touch on how to assure messages make it from producer to consumer, and how to monitor the entire ecosystem.
This presentation was given at the ApacheCon 2015 Kafka Meetup.
These slides go into some detail on how to tune and scale Kafka clusters and the components involved. The slides themselves are bullet points, and all the detail is in the slide notes, so please download the original presentation and review those.
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
This is a talk given at ApacheCon 2015
If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community.
Note - there are a significant amount of slide notes on each slide that goes into detail. Please make sure to check out the downloaded file to get the full content!
Kafka is a publish/subscribe messaging system that, while young, forms a vital core for data flow inside many organizations, including LinkedIn. We will discuss Kafka from an Operations point of view, including the use cases for Kafka and the tools LinkedIn has been developing to improve the management of deployed clusters. We'll also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM.
NOTE: I highly recommend viewing the original PPT. It has copious speaker notes for each slide, and the animations will actually work properly.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
6. Monitoring is not Alerting
• Collect everything
• Alert on nothing
• Events are better than metrics
• Tests are better than alerts
• Sleep is best in life
7. • What’s an SLA?
• Availability
• Latency
• Customer Guarantees
Service
Level
Objectives
9. The Three Metrics You Need to Know
Partitions that are not
fully replicated within
the cluster
URP
The overall utilization
of an Apache Kafka
broker
Request
Handlers
How long requests
are taking, in which
stage of processing
Request
Timing
17. Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
19. Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
21. Brokers Don’t Shouldn’t Do Compression
• Kafka brokers are running a new version
• Message format has been set to the new
version
• Clients haven’t upgraded
Up Conversion Down Conversion
• Kafka brokers are running a new version
• Message format is set to an older version
due to clients
• Producer clients update to new version
22. Request Timing
• Remote – Waiting for other brokers
• Response Queue – Waiting to
send
• Response Send - Send to client
• Total – Request handling, end to
end
• Request Queue – Waiting to
process
• Local – Work local to the broker
29. Operating System
And Hardware
Metrics
• What do they mean?
• What application is causing
it?
• Don’t alert unless:
• 100% clear signal
• 100% clear response
34. If You Remember Nothing Else…
• Define your service level objectives
• Monitor your service level objectives
• Metrics that cover many problems are noisy
• Buy Kafka: The Definitive Guide
35. Getting (and Giving) Help
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• Cruise Control
• https://github.com/linkedin/cruise-control
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFK
A
Let me start off by telling you what we’re not talking about today. I won’t be going into the basics of what Kafka is – I assume that if you’re attending Kafka Summit, you have an idea of what it does and how it works. Regardless, you’re going to get some good data here on monitoring, even if you have very limited Kafka knowledge.
However, this also won’t be an encyclopedic look at monitoring. I’m going to discuss a few key sets of metrics, and how to use them. But I won’t even be covering all the Kafka metrics you should look at, never mind all that exist. I encourage you to spin up a JMX tool of choice and explore what’s exposed for sensors in Kafka. I also encourage you to share with the class, whether in posts, talks, or tweets, any gems that you have for your own monitoring.
I’m also not going to talk about automation, even as it relates to handling alerts. There are many fine talks out there about automating responses and runbooks, and we could spend hours talking about just that.
So why am I here today talking about monitoring? There are lots of topics that could be covered, especially in an ecosystem as large as Kafka. And I could always deliver yet another “here’s how we do it at LinkedIn” talk. However, today I’m choosing to share a look at where we’re moving right now.
I recently wrote a post for DevOps.com about a term we use, “Code Yellow”. This is one of our tools for dealing with an application, or a team, in crisis. Typically this is due to something like communication problems, or a large amount of tech debt. Since I recently wrote this post, and you all know that I work on Kafka, you can probably guess that I’m currently in this state. In our case, it’s due to somewhat unexpected growth.
LinkedIn started using Kafka back in 2010, before it was open sourced. In September of 2015, we announced that we had hit a milestone, at one trillion messages a day produced into our Kafka clusters. Last year, at Kafka Summit in San Francisco, I noted that we had passed two trillion messages a day. At the beginning of the year, we clocked in at three trillion. And now, we’re over five trillion messages a day. That hockey stick at the end is the current source of my long days and sleepless nights.
Top this off with the fact that our monitoring is currently very noisy, partly due to scale problems around this growth, and partly because we alert on many things that are not providing clear signals. We’re currently overhauling our monitoring as a result of this.
So why do we have such noisy alerting? We’ve forgotten that monitoring and alerting are not the same thing.
Today, we're going to be talking about monitoring, not alerting. What is the difference, you ask? In our case, monitoring refers to all the data we have available to us from Kafka and our underlying systems, from high level metrics like partition counts down to the most minute sensor that is available. Alerting, on the other hand, we will use to refer to the metrics that are used to tell us about an imminent problem. They're the metrics that wake us up at night. These should be carefully chosen, and they should be clear signals that demand an immediate response 100% of the time.
Another thing to keep in mind that events are almost always superior to metrics when alerting. We know this, right? Kafka is all about events. And yet we still have measurements that are rates where they should be discrete counts of events. We normally can’t work with individual events, like a failed request, at scale. But we do want to know the actual number of failed requests, and not a requests per second metric where we miss data due to time windows.
We also need to make sure that we’re testing the code before we deploy it. My team has fallen prey to reactive alerting – we find a new problem, like a socket leak, and we add a new alert for file handles in use so we can catch it before it goes critical. The bug gets fixed, but we keep the alert, just in case we run into it again. It would be much better for everyone if we added a release test that checks for the general case of increased file handle usage, and dropped the alert on the live systems.
Alerting should always be aimed at maximizing the amount of sleep that your operations team gets. That means as few alerts as possible to keep everything running, and automating as much as possible.
When we're talking about alerting, the most important thing to watch is the metrics related to your service level objectives, or SLOs. Just as a note, an SLO and an SLA are not the same thing. A service level agreement is a contract: it's basically an SLO with teeth - a penalty. The SLO is the level of service that we're promising to our customers. For Kafka, this is typically going to be that the system will be available, and it will perform at a certain level for produce and consume requests. We'll cover what metrics to use for this in a bit.
In addition to these, your SLOs are whatever you’re guaranteeing to your customers. This may include a minimum amount of retention. If you’re working to GDPR, or another privacy standard, you may specify a maximum amount of time that data will be retained for (here’s a hint, that’s not necessarily the retention in time that you set for the topic).
I've talked at length about the under replicated partition count metric. I dedicated a significant number of pages in a book you may have seen about how to respond to any non-zero value. At it's heart, this number tells you that the replication within the cluster is having a problems.
A stable count on all but one broker tells you that that broker is not working. It's either down, or the replication is not started
A variable count on a single broker tells you that that broker is having a problem servicing consume requests
A variable count on multiple brokers indicates a more overall problem. In this case, you'll need to enumerate the partitions that are falling behind (using the CLI tools) and see if there is a common thread, such as a single broker that is having problems replicating from multiple cluster members.
But the most important thing that the URP metric is, is overrated for alerting. That's right, I said it. I don't like getting paged for this metric. But why, you ask? If it illustrates so many problems, why wouldn't I want to get alerts for it? The problem is that it doesn't tell me that I'm breaching my SLO, and whatever problem it's telling me about is often not immediately actionable. More often than not, this metric tells me about two problems. The first is that a broker is down. I can detect that with a much clearer signal, however, by health checking the application. The other problem is that the cluster is operating over it's capacity. I don't want to be paged for that either because capacity is a proactive monitoring problem, not a reactive problem. We'll talk about that more in a few slides.
Still, you should be collecting this metric, and you might want to consider generating warnings for it. It does illustrate a risky situation, because we depend on replication in the cluster for redundancy. When it's not zero, you have a problem that needs some attention.
As with most applications, Kafka has thread pools to do work. There are several different ones - network handlers, request handlers, log compaction, recovery (which are also used for handling log segments at startup and shutdown). When we’re talking about client traffic, the network and request handlers are the ones that do all the work, and the request handlers are far more important. This is because the network handlers just take care of the network connection, including reading and writing bytes on the wire.
The request handler does everything else for the client - it decodes and validates the protocol, handles produce and consume work, and assembles the response to send back. It even performs all of the broker internal work, responding to controller requests. This means that if you want a single indicator of how busy the broker is, you couldn’t ask for a much better measure than the utilization of the request handlers. But as with under-replicated partitions, there are a lot of different problems that could be indicated here
CPU - Slow disk performance, often due to a failing drive, is a particular problem for produce requests. As the request handler will have to take more time when writing to disk, it will manifest as higher utilization
Timeouts and deadlocks look very similar
Timeouts - all of the request handler threads are getting tied up. We most often see this when the broker is starting up, and it is failing to process requests from the controller within the controller socket timeout.
Deadlock - But if that doesn’t solve it, you may have hit a deadlock condition in handling requests. We’ve seen this recently with some shutdown code, but it was related to the authorizer we were using and not Kafka directly.
Here are the produce TotalTime graphs for a broker that is working perfectly well. (Include 50th, 99th, and 999th). If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
Timeouts most often happen when controller requests are not processed within the controller socket timeout. What happens is that the controller sends the request, it times out, and then the controller sends the request again. You’ll see this especially when the broker is starting up, and the controller is trying to send it the state of the world with leader and ISR requests
Deadlocks look almost identical, but they’re much more rare. We’ve seen them recently during shutdown, but that was caused by an issue in the authorizer module that we use, and not something that was endemic to Kafka itself. However, they’re almost always code issues. This makes them pretty tricky to debug.
Wait, the Kafka brokers don’t compress data anymore! We got rid of that with the bump to message format 1, and relative offsets in the produced batches. Right?
Yeah, that’s what I thought, too. Turns out that there are a couple cases, which are not as rare as you might think, that will result in the broker having to rewrite the incoming message batches.
Another common culprit for the request handlers being over utilized, even at a low traffic volume, is due to compression. This happens when the client versions do not match the message format on disk. The (config name) is settable via a broker configuration, and controls how messages are written to disk. In an ideal world, the producer client version matches this configuration, such that the producer is sending the same message format. If the producer is an older version, the broker will have to upconvert the messages, and if the producer is using a higher message format version the broker will need to down convert. Both of these situations means the broker will be forced to recompress the message batch before writing it to disk (this also happens if your brokers are still using message format zero). This is an expensive operation, and should be avoided. It’s also worth noting that you can set the message format on disk as a per-topic override. You will want to be very careful if you feel the need to do this, as it means the logs on disk are inconsistent, and you could easily have compression you’re not expecting.
If you have slow request processing due to issues like this, you’re also going to have latency issues. Which gets us into the third set of metrics...
For each protocol request type, Kafka provides a set of timing metrics. These describe the amount of time that the request spends in various states while being processed:
Total time - this is the overall total time to process a request, from when it is received to when it is complete
Request Queue Time - how long the request sits in queue before being picked up by a request handler for processing
Local Time - The amount of local processing time required for the request. This can include a number of things, such as disk write time for produce requests
Remote Time - The amount of time that the request waits on non-local steps. This includes acknowledgements from followers for produce requests
Response Queue Time - how long the response for the request sits in queue before being sent to the client
Response Send Time - how long it takes to send the response to the client. This only covers getting it into the send buffers locally, not network time.
In addition to the time metrics, there is also a rate metric that gives you the number of requests of a particular type per second. The time metrics are provided as percentiles, and as such you can choose from 50th, 75th, 99th, and 99.9th percentiles, as well as an average and maximum value over the course of the running process.
Request latency is typically going to be the first of your SLO measurements. Which means that you will probably want to be monitoring these metrics and possibly alerting off them. The problem comes in as you try to pick which attributes to monitor, and what the baseline values are.
Here are the produce TotalTime graphs, 50th percentile and 99.9th percentile, for a broker that is working perfectly well. It may be hard to see, but the scale of the first graph is in single digits, and the scale of the second is in thousands. If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
Let’s consider the local time. Again, these are the 50th percentile and the 99.9th percentile, and the first graph goes from zero to one, while the second graph is again in the thousands. What would impact the amount of time required to process the produce requests locally? In this case, most of our produce requests are really small - small batches, single topic - but some of them are very large. The bigger the produce request, the more time it takes to write the data to disk.
How about the remote time for the same produce requests? Yet again, these are the 50th and 99.9th percentile graphs, with the first one being from zero to two, and the second being in the thousands. The average value is small, but the 999th is multiple orders of magnitude higher. The most common cause here is that most of our requests are being produced with the required acknowledgements being set to 1, while some are requesting all acknowledgements. That easily drives up the amount of time spent in the remote step.
This isn’t to say that you can’t use these metrics effectively for alerting. It just means that you need to define your SLOs appropriately. Stating simply that produce requests will be handled in 20ms or less may not be reasonable, but specifying that value for the average produce request may be fine.
OK, so we’ve covered our three metrics, and we’ve still got X minutes left in this talk. I could sit here and just stare at my phone for the rest of the time. Or …
We could talk about what’s missing, since we only covered a very small slice of monitoring for Kafka.
The other side of your service level objectives is probably going to be the availability of Kafka to handle requests. But as with any system, you can’t truly measure the availability of a Kafka cluster from the brokers themselves. There are many factors that go into availability, including whether or not the network is working. Looking at the broker itself may tell you that everything’s fine, meanwhile none of your clients can connect.
For monitoring availability, you need to use something external to the Kafka cluster to look at it from the client’s point of view. This is why LinkedIn created, and open sourced, kafka-monitor (https://github.com/linkedin/kafka-monitor). This runs a producer and a consumer for each cluster, and assures that both requests work properly. It can assure that there is at least one partition on each broker in the cluster, so you check the entire cluster. It also provides latency metrics for requests, so you have an objective view of the request timings we were just talking about.
So what should we do about lower level OS and hardware metrics? Well, let me ask you this. I have a Kafka cluster that’s running at 95% CPU, what do I do? Well, if it’s serving requests properly and within the SLO, I go get a cup of coffee. I might need to look at it, but it’s not a crisis.
Most metrics, OS or otherwise, are a great recipe for creating lots of alert noise that is not actionable. CPU and memory usage could be high due to other applications, and in most cases relate to overall capacity and not to the application’s performance or current state of functionality. You should definitely collect them so that you can go back and debug problems later. If you’re thinking about setting up an alert you need to ask yourself two things:
Is this always actionable when the alert goes off?
Is the action 100% clear?
If the answer to either of these is something along the lines of “Yes, but…” you need to stop and rethink what you’re trying to accomplish. But, Todd! I need to monitor things like disk usage, don’t I? Yes, of course we do, but this falls under the heading of capacity planning.
My Kafka environment, like many of yours, is shared between many different applications. You may even have some of the tech debt that we have, where you have little control over when someone starts using it for a new service. This means that we should be keeping an eye on the capacity of the system, and preemptively adding more.
Preemptively is the key word here. You want to deploy new brokers before you’ve hit 100% capacity, which means that you need to order them earlier than that.
I am no magician, contrary to the perception that many have of my ability to solve problems. It does me no good to get an alarm in the middle of the night that we’re approaching saturation, as I can’t magically make new hardware appear. And if I already have the hardware, it should have been added to the clusters so that I never hit a crisis point.
The metrics that I’m mostly interested in for judging capacity are:
Request handler pool idle ratio
Disk utilization
Partition Count
Network utilization
You should be trending these metrics over time, and reviewing them on a regular basis. You may want to have some sort of alert once capacity is approaching a point where you need to get more, but that should be an email, or even better, and automatic work ticket in your system of choice. Additionally, make sure you’re making use of features like quotas and retention of messages by size so that you can minimize any surprises.
If you take nothing else away from today’s talk, leave with this.
First, you must define what your service level objectives are for Kafka within your organization. Even if you’re running at a small scale, and with a limited number of customers. Even if you’re the only customer of your cluster. Make it clear what the expectations are, and hold to them.
Next, once you have those SLOs, that is what you need to be monitoring. David Henke, who led Engineering and Operations at LinkedIn for many years, would often say “What gets measured, gets fixed.” If you do not monitor your SLOs, then they do not really count.
But beware of metrics that inform you to many different problems. They are typically noisy, and they often make it difficult to determine what the underlying problem is. They are attractive, because it’s a single number that says “something is wrong”, but they will drive you crazy in the end.
And lastly, buy yourself a copy of Kafka: The Definitive Guide. In fact, you should buy two or three. Because reasons.