SignalFx engineer Rajiv Kurian's presentation on why we wrote our own Kafka consumer, the performance goals, and the performance gains achieved.
Download the slides to see animations showing hardware details. These slides were converged from Keynote to Powerpoint, so there may be some oddness with slide transitions!
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...confluent
Cloud providers like AWS allow free data transfers within an Availability Zone (AZ), but bill users when data moves between AZs. When the data volume streamed through Kafka reaches big data scale, (e.g. numeric data points or user activity tracking), the costs incurred by cross-AZ traffic can add significantly to your monthly cloud spend. Since Kafka serves reads and writes only from leader partitions, for a topic with a replication factor of 3, a message sent through Kafka can cross AZs up to 4 times. Once when a producer produces a message onto broker in a different AZ, two times during Kafka replication, and once more during message consumption. With careful design, we can eliminate the first and last part of the cross AZ traffic. We can also use message compression strategies provided by Kafka to reduce costs during replication. In this talk, we will discuss the architectural choices that allow us to ensure a Kafka message is produced and consumed within a single AZ, as well as an algorithm that lets consumers intelligently subscribe to partitions with leaders in the same AZ. We will also cover use cases in which cross-AZ message streaming is unavoidable due to design limitations. Talk outline: 1) A review of Kafka replication, 2) Cross-AZ traffic implications, 3) Architectural choices for AZ-aware message streaming, 4) Algorithms for AZ-aware producers and consumers, 5) Results, 6) Limitations, 7) Takeaways.
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...confluent
Cloud providers like AWS allow free data transfers within an Availability Zone (AZ), but bill users when data moves between AZs. When the data volume streamed through Kafka reaches big data scale, (e.g. numeric data points or user activity tracking), the costs incurred by cross-AZ traffic can add significantly to your monthly cloud spend. Since Kafka serves reads and writes only from leader partitions, for a topic with a replication factor of 3, a message sent through Kafka can cross AZs up to 4 times. Once when a producer produces a message onto broker in a different AZ, two times during Kafka replication, and once more during message consumption. With careful design, we can eliminate the first and last part of the cross AZ traffic. We can also use message compression strategies provided by Kafka to reduce costs during replication. In this talk, we will discuss the architectural choices that allow us to ensure a Kafka message is produced and consumed within a single AZ, as well as an algorithm that lets consumers intelligently subscribe to partitions with leaders in the same AZ. We will also cover use cases in which cross-AZ message streaming is unavoidable due to design limitations. Talk outline: 1) A review of Kafka replication, 2) Cross-AZ traffic implications, 3) Architectural choices for AZ-aware message streaming, 4) Algorithms for AZ-aware producers and consumers, 5) Results, 6) Limitations, 7) Takeaways.
Strategies and techniques to optimize Kafka brokers and producers to minimize data loss under huge traffic volume, limited configuration options, less ideal and constant changing environment and balance against cost.
In the age of NoSQL, big data storage engines such as HBase have given up ACID semantics of traditional relational databases, in exchange for high scalability and availability. However, it turns out that in practice, many applications require consistency guarantees to protect data from concurrent modification in a massively parallel environment. In the past few years, several transaction engines have been proposed as add-ons to HBase; three different engines, namely Omid, Tephra, and Trafodion were open-sourced in Apache alone. In this talk, we will introduce and compare the different approaches from various perspectives including scalability, efficiency, operability and portability, and make recommendations pertaining to different use cases.
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
Infrastructure failures are a given in the cloud, but in a multi-tenant environment separating those failures from usage can be a challenge. I'll be presenting data gathered from over a hundred region server failures at HubSpot along with what we've done to improve our MTTR and what we're contributing back to the community. Covered topics will include separating usage-related failures from infrastructure and hardware failures, as well as steps we've taken to improve MTTR in both scenarios.
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon
OpenTSDB continues to scale along with HBase. A number of updates have been implemented to push writes over 2 million data points a second. Here we will discuss about HBase schema improvements, including salting, random UI assignment, and using append operations instead of puts. You'll also get AsyncHBase development updates about rate limiting, statistics, and security.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
Kafka on ZFS: Better Living Through Filesystems confluent
(Hugh O'Brien, Jet.com) Kafka Summit SF 2018
You’re doing disk IO wrong, let ZFS show you the way. ZFS on Linux is now stable. Say goodbye to JBOD, to directories in your reassignment plans, to unevenly used disks. Instead, have 8K Cloud IOPS for $25, SSD speed reads on spinning disks, in-kernel LZ4 compression and the smartest page cache on the planet. (Fear compactions no more!)
Learn how Jet’s Kafka clusters squeeze every drop of disk performance out of Azure, all completely transparent to Kafka.
-Striping cheap disks to maximize instance IOPS
-Block compression to reduce disk usage by ~80% (JSON data)
-Instance SSD as the secondary read cache (storing compressed data), eliminating >99% of disk reads and safe across host redeployments
-Upcoming features: Compressed blocks in memory, potentially quadrupling your page cache (RAM) for free
We’ll cover:
-Basic Principles
-Adapting ZFS for cloud instances (gotchas)
-Performance tuning for Kafka
-Benchmarks
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...Flink Forward
Apache Flink provides powerful stream processing capabilities which can allow organizations to move directly from batch to real time analytics, skipping the lambda architecture entirely. However, getting to production is not always as simple as rewriting your job in a new API, but requires rethinking your application design with a stream first mindset. This talk will cover MediaMath’s journey in rebuilding its reporting infrastructure using Apache Flink. We will discuss high level architectural designs when building an extensible reporting platform as well as deep dive into specific technical hurdles. Topics will include managing a Flink cluster on EC2 spot instances, reconciling Flink’s consistency model with S3’s, handling massive data skew as well as tools and techniques for building performant, fault tolerant streaming applications.
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxDatabricks
Apache Kafka’s rise in popularity as a streaming platform has demanded a revisit of its traditional at least once message delivery semantics. In this talk, we present the recent additions to Apache Kafka to achieve exactly once semantics. We shall discuss the newly introduced transactional APIs and use Kafka Streams as an example to show how these APIs are leveraged for streams tasks.
High Performance Erlang - Pitfalls and SolutionsYinghai Lu
Presented at Erlang Factory 2016, San Francisco, CA.
Erlang is widely used for building concurrent applications. However, when we push the performance of our Erlang based application to handle millions of concurrent clients, some Erlang scalability issues begin to show and some conventional programming paradigm of Erlang no longer hold. We would like to share some of these issue and how we address them. In addition, we share some of our experience on how to profile an Erlang application to identify bottlenecks.
We will take a deep look at some of the basic mechanisms of Erlang and show how they behave under high load and parallelism, which includes message delivery, process management and shared data structures such as maps and ETS tables. We will demonstrate their limitations and propose techniques to alleviate the issues.
We will also share profiling techniques on how to find those bottlenecks in Erlang applications across different levels. We will share techniques for writing highly performant Erlang applications.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
AWS Loft Talk: Behind the Scenes with SignalFxSignalFx
Slides from SignalFx CTO Phillip Liu's presentation at the AWS Loft in SF after DockerCon: Behind the Scenes with SignalFx.
Phil discussed how SignalFx deploys, runs, and operates a completely Dockerized microservices architecture for a production SaaS application dealing with large volumes of high resolution customer data.
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15SignalFx
SignalFx engineer Paul Ingram presented these slides at Cassandra Summit 2015.
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
Read more: http://blog.signalfx.com/making-cassandra-perform-as-a-tsdb
Strategies and techniques to optimize Kafka brokers and producers to minimize data loss under huge traffic volume, limited configuration options, less ideal and constant changing environment and balance against cost.
In the age of NoSQL, big data storage engines such as HBase have given up ACID semantics of traditional relational databases, in exchange for high scalability and availability. However, it turns out that in practice, many applications require consistency guarantees to protect data from concurrent modification in a massively parallel environment. In the past few years, several transaction engines have been proposed as add-ons to HBase; three different engines, namely Omid, Tephra, and Trafodion were open-sourced in Apache alone. In this talk, we will introduce and compare the different approaches from various perspectives including scalability, efficiency, operability and portability, and make recommendations pertaining to different use cases.
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
Infrastructure failures are a given in the cloud, but in a multi-tenant environment separating those failures from usage can be a challenge. I'll be presenting data gathered from over a hundred region server failures at HubSpot along with what we've done to improve our MTTR and what we're contributing back to the community. Covered topics will include separating usage-related failures from infrastructure and hardware failures, as well as steps we've taken to improve MTTR in both scenarios.
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon
OpenTSDB continues to scale along with HBase. A number of updates have been implemented to push writes over 2 million data points a second. Here we will discuss about HBase schema improvements, including salting, random UI assignment, and using append operations instead of puts. You'll also get AsyncHBase development updates about rate limiting, statistics, and security.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
Kafka on ZFS: Better Living Through Filesystems confluent
(Hugh O'Brien, Jet.com) Kafka Summit SF 2018
You’re doing disk IO wrong, let ZFS show you the way. ZFS on Linux is now stable. Say goodbye to JBOD, to directories in your reassignment plans, to unevenly used disks. Instead, have 8K Cloud IOPS for $25, SSD speed reads on spinning disks, in-kernel LZ4 compression and the smartest page cache on the planet. (Fear compactions no more!)
Learn how Jet’s Kafka clusters squeeze every drop of disk performance out of Azure, all completely transparent to Kafka.
-Striping cheap disks to maximize instance IOPS
-Block compression to reduce disk usage by ~80% (JSON data)
-Instance SSD as the secondary read cache (storing compressed data), eliminating >99% of disk reads and safe across host redeployments
-Upcoming features: Compressed blocks in memory, potentially quadrupling your page cache (RAM) for free
We’ll cover:
-Basic Principles
-Adapting ZFS for cloud instances (gotchas)
-Performance tuning for Kafka
-Benchmarks
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...Flink Forward
Apache Flink provides powerful stream processing capabilities which can allow organizations to move directly from batch to real time analytics, skipping the lambda architecture entirely. However, getting to production is not always as simple as rewriting your job in a new API, but requires rethinking your application design with a stream first mindset. This talk will cover MediaMath’s journey in rebuilding its reporting infrastructure using Apache Flink. We will discuss high level architectural designs when building an extensible reporting platform as well as deep dive into specific technical hurdles. Topics will include managing a Flink cluster on EC2 spot instances, reconciling Flink’s consistency model with S3’s, handling massive data skew as well as tools and techniques for building performant, fault tolerant streaming applications.
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxDatabricks
Apache Kafka’s rise in popularity as a streaming platform has demanded a revisit of its traditional at least once message delivery semantics. In this talk, we present the recent additions to Apache Kafka to achieve exactly once semantics. We shall discuss the newly introduced transactional APIs and use Kafka Streams as an example to show how these APIs are leveraged for streams tasks.
High Performance Erlang - Pitfalls and SolutionsYinghai Lu
Presented at Erlang Factory 2016, San Francisco, CA.
Erlang is widely used for building concurrent applications. However, when we push the performance of our Erlang based application to handle millions of concurrent clients, some Erlang scalability issues begin to show and some conventional programming paradigm of Erlang no longer hold. We would like to share some of these issue and how we address them. In addition, we share some of our experience on how to profile an Erlang application to identify bottlenecks.
We will take a deep look at some of the basic mechanisms of Erlang and show how they behave under high load and parallelism, which includes message delivery, process management and shared data structures such as maps and ETS tables. We will demonstrate their limitations and propose techniques to alleviate the issues.
We will also share profiling techniques on how to find those bottlenecks in Erlang applications across different levels. We will share techniques for writing highly performant Erlang applications.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
AWS Loft Talk: Behind the Scenes with SignalFxSignalFx
Slides from SignalFx CTO Phillip Liu's presentation at the AWS Loft in SF after DockerCon: Behind the Scenes with SignalFx.
Phil discussed how SignalFx deploys, runs, and operates a completely Dockerized microservices architecture for a production SaaS application dealing with large volumes of high resolution customer data.
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15SignalFx
SignalFx engineer Paul Ingram presented these slides at Cassandra Summit 2015.
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
Read more: http://blog.signalfx.com/making-cassandra-perform-as-a-tsdb
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx
From our Feb 25, 2016 webcast on operating Elasticsearch at scale, the metrics to monitor, and how to create low-noise meaningful alerts on Elasticsearch performance.
Maxime Petazzoni, Software Engineer at SignalFx, presents how we use Docker and how we monitor containers in production.
SignalFx has been using using Docker since November 2013. We have running Docker in prod ever since we’ve had a “prod” and back when Docker’s README said “DO NOT RUN IN PRODUCTION”.
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemSignalFx
Presented at GlueCon 2015.
This presentation discusses SignalFx CTO and co-founder Phillip Liu's experience operating infrastructure and apps at massive scale and what drove the realization that monitoring is fundamentally an analytics problem now. Following on the heels of Adrian Cockroft's keynote that morning, Monitoring Microservices and Containers, this presentation went over real world examples of how modern monitoring for microservices wroks.
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...SignalFx
Zenefits principal engineer Venkat Thiruvengadam and SignalFx engineer Maxime Petazzoni discuss operationalizing Docker at scale. Learn about the transition to a well-conceived microservices approach, the tools chosen to support these services, and the lessons learned from monitoring containers in production in a high-performance environment.
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements.
Topics include:
- What latencies and throughputs you should expect from Kafka
- How to select hardware and size components
- What you should be monitoring
- Design patterns and antipatterns for client applications
- How to go about diagnosing performance bottlenecks
- Which configurations to examine and which ones to avoid
This presentation was given at the ApacheCon 2015 Kafka Meetup.
These slides go into some detail on how to tune and scale Kafka clusters and the components involved. The slides themselves are bullet points, and all the detail is in the slide notes, so please download the original presentation and review those.
Go debugging and troubleshooting tips - from real life lessons at SignalFxSignalFx
Exploring tips and advice on writing production Go systems that are easy to debug and troubleshoot. Jack Lindamood from SignalFx presents patterns that facilitate this process.
Jack addresses tools built into Go you can take advantage of, build process techniques they've learned over time, and open source tools and libraries you can use that help troubleshoot your production code when things go wrong.
Read more here: http://blog.signalfx.com/a-pattern-for-optimizing-go
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
This is a talk given at ApacheCon 2015
If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community.
Note - there are a significant amount of slide notes on each slide that goes into detail. Please make sure to check out the downloaded file to get the full content!
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
Using Docker for GPU Accelerated ApplicationsNVIDIA
Build and run Docker containers leveraging NVIDIA GPUs. Containerizing GPU applications provides several benefits, among them:
* Reproducible builds
* Ease of deployment
* Isolation of individual devices
* Run across heterogeneous driver/toolkit environments
* Requires only the NVIDIA driver to be installed
* Enables "fire and forget" GPU applications
* Facilitate collaboration
Purpose of the session is to have a dive into Apache, Kafka, Data Streaming and Kafka in the cloud
- Dive into Apache Kafka
- Data Streaming
- Kafka in the cloud
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUANatan Silnitsky
Kafka is the bedrock of Wix’s distributed Mega Microservices system.
Over the years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1400 mostly Scala microservices.
In this talk, you will learn about 10 key decisions and steps you can take in order to safely scale-up your Kafka-based system.
These Include:
* How to increase dev velocity of event-driven style code.
* How to optimize working with Kafka in polyglot setting
* How to migrate from request-reply to event-driven
* How to tackle multiple DCs environment.
This presentation is from the Gophercon-India where we talked about how to design a concurrent high performance database client in go language. We talked about how we use goroutines and channels to our advantages. we also talked about how to use pools for efficient memory utilization.
In this presentation we consider how to resolve Firebird performance problems: what Firebird database parameters we need to monitor and how we need to tune Firebird configuration and adjust client applications.
Speakers: Liang Xie and Honghua Feng (Xiamoi)
This talk covers the HBase environment at Xiaomi, including thoughts and practices around latency, hardware/OS/VM configuration, GC tuning, the use of a new write thread model and reverse scan, and block index optimization. It will also include some discussion of planned JIRAs based on these approaches.
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014Amazon Web Services
Tuning your EC2 web server will help you to improve application server throughput and cost-efficiency as well as reduce request latency. In this session we will walk through tactics to identify bottlenecks using tools such as CloudWatch in order to drive the appropriate allocation of EC2 and EBS resources. In addition, we will also be reviewing some performance optimizations and best practices for popular web servers such as Nginx and Apache in order to take advantage of the latest EC2 capabilities.
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
In the financial industry, losing data is unacceptable. Financial firms are adopting Kafka for their critical applications. Kafka provides the low latency, high throughput, high availability, and scale that these applications require. But can it also provide complete reliability? As a system architect, when asked “Can you guarantee that we will always get every transaction,” you want to be able to say “Yes” with total confidence.
In this session, we will go over everything that happens to a message – from producer to consumer, and pinpoint all the places where data can be lost – if you are not careful. You will learn how developers and operation teams can work together to build a bulletproof data pipeline with Kafka. And if you need proof that you built a reliable system – we’ll show you how you can build the system to prove this too.
Presented at LISA18: https://www.usenix.org/conference/lisa18/presentation/babrou
This is a technical dive into how we used eBPF to solve real-world issues uncovered during an innocent OS upgrade. We'll see how we debugged 10x CPU increase in Kafka after Debian upgrade and what lessons we learned. We'll get from high-level effects like increased CPU to flamegraphs showing us where the problem lies to tracing timers and functions calls in the Linux kernel.
The focus is on tools what operational engineers can use to debug performance issues in production. This particular issue happened at Cloudflare on a Kafka cluster doing 100Gbps of ingress and many multiple of that egress.
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamStewart Needham
For AAA games now there is a consumer expectation that the developer has a post release strategy. This strategy goes beyond just DLC content. Users expect to receive bug fixes, balancing updates, gamemode variations and constant tuning of the game experience. So how can you architect your game technology to facilitate all of this? Stewart explains the unique patching system developed for Crysis 3 Multiplayer which allowed the team to hot-patch pretty much any asset or data used by the game. He also details the supporting telemetry, server and testing infrastructure required to support this along with some interesting lessons learned.
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterAnne Nicolas
NDIV is a young, very simple, yet efficient network traffic diverter. Its purpose is to help build network applications that intercept packets at line rate with a very low processing overhead. A first example application is a stateless HTTP server reaching line rate on all packet sizes.
Willy Tarreau, HaproxyTech
JDD2015: Make your world event driven - Krzysztof DębskiPROIDEA
MAKE YOUR WORLD EVENT DRIVEN
Just after you set up your first microservice you realize that the game has just started. You need to improve latency in your application and reduce unnecessary communication.
To make your architecture fully decoupled you need to embrace asynchronous communication. Good way to achieve that is to switch to Event Driven Architecture.
We will see how to use Kafka in your microservices. We will also cover some pitfalls you might face during using Kafka and how to deal with them.
After the talk you will know the toolset that are need to improve your microservice ecosystem.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
5. • High resolution:
• Any mix of resolutions up to 1 sec
• Streaming analytics:
• Custom analytics pipelines at any scale that output in seconds
• Streaming dashboards update in seconds
• Multidimensional metrics:
• Dimensions allow arbitrary modeling, pivoting, filtering, and
grouping of both raw and derived (from analytics) metrics
interactively on streaming data
• E.g. 99th-percentile-of-latency-by-service-by-customer
SignalFx is built for monitoring modern infrastructure
6. • Designed to replace SimpleConsumer not the 0.9
consumer
• Needed a non-blocking single threaded consumer
• Wanted it to be low over head
• 100s of thousands of messages/second
• Sensitive to GC
• The Kafka 0.9 consumer wasn’t ready yet
Why write a new Kafka consumer
17. Cache Lines
• Data is transferred between memory and cache in
blocks of fixed size, called cache lines (typically 64
bytes)
• The memory subsystem makes a few bets to help us:
• Temporal locality
• Spatial locality
• Prefetching
25. Optimization aims
• We are NOT aiming for more data/second
• Even a very inefficient implementation
will be bottlenecked by the network
• We are aiming to make the client get out of
the way
• The client is not the only thing running on
the system
• Leave all resources for the actual
application
26. Efficiency VS raw speed
• We value efficiency more than raw speed
for the client
• Fewer cycles
• Less cache usage and fewer cache
misses
• Less memory?
• Efficiency for the client == raw speed for
the application
27. Efficiency from constraints
• No consumer group functionality needed
• A single topic
• Finite number of integer partitions
• Partition reassignment is rare and happens
during startup and shutdown
• We are in control of the code that consumes
the messages
29. Use arrays and open addressing hash maps
• Single topic. Less than 1024 partitions
• Instead of maps we can use arrays
• Or use primitive specialized open
addressing hash maps
35. Low memory and cache friendly data structures
• Queues built from integer arrays. Negative ->
partition lost
• Zero allocation hashed-wheel timer to close
stuck connections
• Open addressing hash maps
• BitSets coded on top of long arrays whenever a
set of partitions is required
• Can be traversed in O(num set bits)
36. Applicability and benefit to Kafka consumer 0.9
• Benefits - medium
• Lots of hash map look ups
• Applicability - low
• Multiple topics - sparse arrays not a great
match
• Open addressing hash maps - preserve
most of the benefits
38. Eliminate redundant work
• A single topic. Finite number of partitions:
• Topic and client string immutable
• The metadata request buffer can be created just once and
kept around forever
• Other requests can have their fixed part written out and only
write the variable part on each request
• Offset request
= fixed_part + per_partition_part
• Fetch request create
= fixed_part + per_partition_part
43. Code
private void setNewOffsetsForFetchRequest() {
final ByteBuffer buffer = this.fetchRequestBuffer;
// Iterate through the partitions assigned to this broker
// and write the offset directly on the buffer.
for (int i = 0; i < partitionAssignment.length; i++) {
// This loop runs in O(partitions assigned).
long bitSet = partitionAssignment[i];
while (bitSet != 0) {
final long t = bitSet & -bitSet;
final int partitionId = i * 64 + Long.bitCount(t - 1);
// The position in the buffer that points to the
// beginning of the offset for this partition.
final int bufferPositionForOffset = fetchRequestIndex[partitionId];
final long offset = partitionToOffset[partitionId];
// Write the offset directly.
buffer.putLong(bufferPositionForOffset, offset);
bitSet ^= t;
}
}
}
46. Applicability and benefit to Kafka consumer 0.9
• Benefits - high
• Reuse instead of allocating - temporal locality
• Steaming through 3 arrays - prefetching
• One fetch request per fetch response - common
• Metadata or offset requests - rare
• Applicability - high
• Internal detail so API doesn’t change
• Even for consumer groups, partition reassignment
and partition migration events are rare
48. Stream responses to application
• Pass each message to the application
when it is ready
• Consume messages synchronously
without a copy or allocation
• No deserialization required
• Benefits add up when processing 100s
of thousands of messages per second
49. Low level interface
public interface KafkaMessageHandler {
void handleMessage(ByteBuffer buffer, int position, int length);
}
public interface KafkaConsumer {
void poll(KafkaMessageHandler handler, long timeoutMs);
. . .
. . .
}
54. Applicability and benefit to Kafka consumer 0.9
• Benefits - very high
• Reuse response buffer, no allocations - temporal locality
• Data is processed right after being read from the socket -
temporal locality
• Streaming through a buffer - spatial locality + prefetching
• Combine with DirectByteBuffers for zero copy
• Applicability - low
• API too low level
• Integrity of internal buffers compromised by bugs in
application
• Maybe a low level “with great power comes great
responsibility” API
56. Caveats
• These are from running a very specific
workload similar to our application
• There are many Pareto-optimal choices
for a client. Our’s is not better in any
way - it’s just tuned for our workload
• It can and will prove bad for other
workloads
57. Benchmark
• Single topic-partition
• Settings of fetch_max_wait, fetch_min_bytes,
max_bytes_per_partition were identical
• Only 5000 messages per second produced by
a single producer
• Each message is 23 bytes
• Warm up -> profile for 5 mins
• 5000/sec * 5 mins = 1.5 million
• Profiler = Java Mission Control
A Kafka cluster has multiple brokers. Each broker is a process of its own with an unique id.
The unit of serializability in Kafka is a partition. Each partition has all its messages ordered.
I like to think of a topic as a group of partitions.
A partition has a statically assigned leader. From the POV of regular clients all read/write operations must go through the leader.
So a client needs to know the mapping of topic-partitions to brokers. This mapping can change dynamically.
A client begins by sending a metadata request to know this mapping. A metadata request can be sent to any broker in the cluster.
The broker then replies with a metadata response.
So the client can now form a map of partitions to brokers.
Next the client needs to build a table of partition -> next offset to consume. It can get it from the consumer group functionality or some other external source.
Once this is built it can send fetch requests for actual data.
As long as there is actual data to consume and no errors it gets back a fetch response.
Data is transferred between memory and cache in blocks of fixed size, called cache lines (typically 64 bytes). If you need a single byte, 63 others are coming in for the ride and paying the full tax. So you might as well use these bytes.
When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. In the case of:
1. a cache hit, the processor immediately reads or writes the data in the cache line
2. a cache miss, the cache allocates a new entry and copies in data from main memory, then the request (read or write) is fulfilled from the contents of the cache
Data is transferred between memory and cache in blocks of fixed size, called cache lines (typically 64 bytes). If you need a single byte, 63 others are coming in for the ride and paying the full tax. So you might as well use these bytes.
An application summing numbers in nodes of a linked list might take one cache miss per node.
Spatial locality and prefetching help a lot when summing an array on the other hand. The compiler is also able to write better vectorized code if your layout looks like this.
We really really care about cache usage and cache misses.
We don’t care about memory as much.
So efficiency for the client means more resources for the application which means a faster application.
Almost all our optimizations are based on constraints that come from our use of the consumer. So, many of them are not directly applicable to generic Kafka clients which need to work well under various scenarios.
We need no consumer group functionality. We manage partitions and offsets outside of Kafka. This makes our client super simple.
A single topic. Our applications mostly consume a topic.
We have a finite small number of partitions. Usually <= 1024.
Partition reassignment is rare. I would imagine that this is true for most applications.
Control of the entire pipeline means we can make some assumptions that a generic client cannot. End to end principle.
Now the interesting part.
Since we have a single topic, all partitions implicitly belong to that topic. So we don’t need a concept of topic-partition. We only have partitions. Since we don’t need topic-partition objects we can store all per partition data in arrays with the array index = partition number.
It is important to acknowledge that this is a tradeoff. Like we said before we really care about cache space and cache misses. We are ready to trade off using extra memory to reduce our cache usage. Here is an example:
Let’s imagine that we have a java util hash map of partition to offset. We’ve already shown that we can have multiple cache misses to do an offset get or put. Now let’s imagine that we have a single partition 0 with an offset 116, store in this map. How much memory does this use?
We’ll be generous and assume that headers are only 8 bytes and references are only 4 bytes.
So let’s assume that the entry array was preallocated for 2 entries. There is a 8 byte header, a 4 byte length and two 4 byte references. That’s 20 bytes. Similarly the actual entry is itself 16 bytes and the boxed long is 16 bytes and the boxed integer is 12 bytes. So in spite of all the references and indirection it only uses 256 bytes of memory.
On the other hand let’s assume that our sparse array has been preallocated for 1024 partitions. So it has a 4 byte length, a 8 byte header and 1024 8 byte entries so a total of 8204 bytes which is around 8 KB. This is a lot more than 64 bytes and kind of wasteful.
Now let’s look at how much cache is used by each solution. Each cache line is 64 bytes. So even if you want a single byte 63 unrelated bytes might come along for the ride.
Now let’s look at the java hash map again. We first need to fetch the right entry - that’s one cache line. The other entry comes alone for the ride and possibly the length and header. So that’s 64 bytes already.
Now the actual entry is on another cache line. That is another cache line used up.
Now we need to look at the contents of the boxed partition. That’s another random memory location so a new cache line.
Finally we fetch the offset itself and that’s another cache line.
So it’s 4 cache lines and hence 256 bytes of cache used up through a simple get request.
Now let’s look at the sparse offset array. We know where to fetch it from so with a single cache fetch we get the offset. It comes with potentially 7 other offsets none of which might be useful, but it’s still a single cache line. So we use only 64 bytes!
This example is a bit counter-intuitive. It goes to show that a data structure using only 64 bytes of memory can actually use many more times that memory in cache and a data structure using 8 KB of memory might only use a single cache line. This is a bit like virtual memory vs physical memory. You can use a lot of virtual memory but use little physical memory and come out ahead. In our example physical memory is abundant (we have gigabytes of it). Cache memory is very limited. We only have around 32 KB of L1 cache for example so it’s much more precious that physical memory.
This also shows how we are ready to make trade offs. Sparse arrays can take more memory but have a pretty guaranteed worse case cache usage and cache miss number.
We talked about the main data structures. Our other data structures especially for our state machine implementation are all designed to be zero allocation in the steady state and very cache friendly. Even our hashed-wheel timer is made of primitive arrays with very few indirections.
Since we service a single topic per client, we can stamp out the client id and topic id bits and never change them.
Since this variable sized portion rarely changes, we can afford to create an index to it. So we have an index of a partition to it’s position within the fetch request ByteBuffer.
So let’s imagine that we sent this particular fetch request with partitions 0, 1 and 2. We have a response and the offsets have been advanced as shown by the offsets table. Now to create the next fetch request, we just read the new offsets and use the index to write them directly on the old buffer.
And we are ready to send this buffer. We avoided all the work required to create a buffer, write out the fixed size fields etc. It’s just writing a few integers to locations in memory.
This is how the code looks. There is a bit of noise in the code because we are iterating a bit set representing the partition assignment. But otherwise the code is simple - fetch the position within our request buffer for this partition. Get the next offset to fetch. Write this offset at the right position.
The metadata request is just frozen after consumer creation.
For the offset request for example we can store a pointer to the num partitions part of the request. So when we need to send a new offset request we can directly seek there and write out the partition bits.
We don’t use JSON, XML, Thrift, ProtocolBuffers etc for our messages. Our messages do not need to be deserialized before consumption. They can be consumed directly just like Kafka’s internal messages can be. There is no POJO created from a serialized message. Instead we can wrap the buffer in a flyweight and consume the fields of our messages by doing reads from the underlying buffer.
So we don’t need any copies or any allocation for steady state processing.
The interface however is low level. Any handler of a message is fed with a buffer and a position and length within that buffer that represents the message. We could also alternatively set a position and limit on the buffer and send it to the application.
The poll call takes such a handler and feeds it with messages.
So let’s imagine that this is a response from Kafka. There are a bunch of fixed size bits on the top that we can skip. The real payload is a message set per partition.
We begin by going to the first message, ensuring there are no errors and then just passing the pointer and length to the handler. It synchronously consumes it making copies if necessary and then returns back to the parsing code.
We then consume the second message.
And the third and so on.
Benefits are huge. Zero copy and zero allocation in the steady path. Since we are not creating a new ByteBuffer every time - DirectByteBuffers become viable. So we elide the copy involved in reading from the socket into HeapByteBuffers.
Sadly the applicability of this optimization is low. We are in control of our buffers and their lifetime so it is easy for us to avoid a copy. It is perhaps possible to create a very low level api that is not the default. I’ve not had much luck pushing this agenda in the past :)
The Kafka client allocates about 423 MB for 5000 * 300 = 1.5 million messages
That’s 86.56% of all allocations.
A size able portion of that is in fetch response parsing. A lot of that is ByteBuffer slicing which our client does not do at all.
We talked about a possible but dangerous way to get rid of this entirely.
1.76% is in the selector.
About 9.27% is in cluster init. I am not sure why that’s so much.
We allocate 218 KB overall to process 5000 * 300 = 1.5 million messages.
The consumer does no allocations of it’s own. There are allocations done by the java NIO stack but they don’t show up in the profile. Selectors allocate and we plan to use an allocation less Selector like the one the Netty project uses.
CPU used was 6.6%.
91% of that was the 0.9 consumer so about 6%
12.% spent on check sum math.
67% on handling fetch responses - we talked about a way to make this very fast.
Some 6% in metadata - not sure why
CPU was 2.63%.
The client uses about 50% of that so 1.31%.
16.67% of that is spent in the select call. So the client code accounts for 33.33% of 2.63 which is 0.88% CPU
Similar story for 10000 messages/second
4x odd for CPU and a lot more for allocations.