Successfully reported this slideshow.
Your SlideShare is downloading. ×

Tuning Kafka for Fun and Profit

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 7 Ad

Tuning Kafka for Fun and Profit

Download to read offline

This presentation was given at the ApacheCon 2015 Kafka Meetup.

These slides go into some detail on how to tune and scale Kafka clusters and the components involved. The slides themselves are bullet points, and all the detail is in the slide notes, so please download the original presentation and review those.

This presentation was given at the ApacheCon 2015 Kafka Meetup.

These slides go into some detail on how to tune and scale Kafka clusters and the components involved. The slides themselves are bullet points, and all the detail is in the slide notes, so please download the original presentation and review those.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Tuning Kafka for Fun and Profit (20)

Advertisement
Advertisement

Tuning Kafka for Fun and Profit

  1. 1. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Tuning Kafka for Fun and Profit
  2. 2. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Zookeeper  5-node vs. 3-node Ensembles  Solid State Disks – Use good SSDs – Transaction logs only – Significant improvement in latency and outstanding requests 2
  3. 3. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Kafka Broker Disks  Disk Layout  JBOD vs. RAID – JBOD and RAID-0 are similar – RAID-5/6 has significant performance overhead – RAID-10 still offers the best performance and protection  Filesystem – New testing shows XFS has a clear benefit – No tuning required – Will be continuing testing with more production traffic 3
  4. 4. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Scaling Kafka Clusters  Disk Capacity  Network Capacity  Partition Counts – Per-Cluster – Per-Broker  Limitations – Topic list length 4
  5. 5. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Topic Configuration  Retention Settings  Partition Counts – Balance over consumers – Balance over brokers – Partition size on disk – Application-specific requirements 5
  6. 6. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Mirror Maker  Network Locality  Consumer Tuning – Number of streams – Partition assignment strategy  Producer Tuning – Number of streams – In flight requests – Linger time 6

Editor's Notes

  • We start talking about tuning from the ground up, and Kafka is underpinned by Zookeeper. This tends to be an application that we forget about unless we have problems, because it just runs, but it needs love too.

    One thing we’ve learned recently is about ensemble sizing in Zookeeper. There has been a lot of work done on performance at different ensemble sizes, and this is largely driven by the ZAB protocol and the network traffic involved. We run either 3-node or 5-node ensembles, with most of the 3-node ensembles being in our staging environments, but we are moving to all 5-node for a very important reason. In order to add a new server to the ensemble, you need to take down each node in turn, add the new server to the config, and bring it back up. If you don’t want to take Zookeeper down, you have to maintain quorum while you do this. If you have one node down in a 3 node cluster due to hardware problems, there is no way to change the server list without an outage because you cannot take a second server offline and maintain quorum.

    The other important change we have made to Zookeeper is to run it on solid state disks. There’s some information out there that suggests this is a bad thing, but our experience has been the opposite. The first thing to note is that we use really good SSDs, not the consumer grade ones you can buy from Best Buy. The Virident cards we use have garbage collection and are very robust. We only put the transaction logs on SSD, keeping the snapshots on spinning disk. By doing this, we have dropped min, max, and average latency to 0ms (from an average of 20ms), with no outstanding requests during normal operations, even at peak load.
  • Moving on from Zookeeper to the Kafka brokers, mostly what we look at here is disk. Our CPU and memory are fairly standard 12-CPU systems (with hyperthreading) and 64 GB of memory, and we do not colocate any other application with Kafka (which is running on physical hardware, not a virtual environment). Having a lot of memory is helpful because Kafka depends on the pagecache to get the best performance for consumers.

    With disk, the more spindles you have, the better off you will be. Produce times are dependent on disk IO (assuming you are not using an acknowledgement setting of 0 where you are producing in a “fire and forget” mode), so the more you can spread that out the better. We have recently done a lot of testing of RAID layouts, to validate that our configuration of using RAID-10 on 14 disks was the optimal layout. What we found is that JBOD and RAID-0 perform the best, but offer no protection of the data (if you lose one disk, you lose everything on that broker). RAID 5 and 6 give you a nice balance of protection and disk capacity, but we ran into significant performance problems (produce times shot up to over 20 seconds in the 99% case). RAID-10 gave us the best balance of performance and protection, and is where we are staying for now. It is notable that we are running software RAID, and have not done any testing with hardware RAID. All of our testing was done with a variety of RAID stripe settings, and we found that at least for RAID-10, the default 512 Kb stripe is the best choice. Larger stripes did not offer a significant improvement.

    We have also been retesting the filesystem lately. Currently, Kafka log segments are stored on an ext4 filesystem, configured with a 120 second commit interval with writeback mode. These settings are obviously unsafe, and we justified it by knowing that we were also replicating data within Kafka and could suffer a system failure. A datacenter power outage changed this view, and we were left with a large amount of disk corruption, both at the file level and the block level. We found that XFS is a better choice of filesystem, offering significant performance benefits without needing to resort to unsafe tuning. We’ll be continuing this testing in some of our staging environments soon.
  • Once we have an optimal configuration for a single broker, we look at how many brokers we need to have in a cluster. The driving factor for us right now is the disk capacity. We use a default retention of 4 days for almost all topics, and having enough disk space to handle this is the primary driver behind increasing the size of a cluster. We threshold our alerts at 60%, and increase the cluster size when we hit this limit. This gives us enough headroom to move partitions around (which resets the retention clock), and wait for new hardware to arrive if needed.

    Another concern with sizing is the network capacity. While Kafka can definitely operate at line speed for a 1 Gigabit NIC, you want to have some overhead reserved for intra-cluster replication and communication. For this reason, we threshold our network alerts at 75%. If we go above that at peak load, we need to spread out the traffic over more systems. This is another good reason to make sure you balance partitions across your brokers as evenly as possible.

    The number of partition you have in your cluster is a lesser, but important, concern. Here we are mostly concerned with the number of partitions on a single broker. We have noticed performance problems above 4000 partitions per-broker, though we are not sure exactly where that problem is (whether it is with open filehandles, data structures in the broker, or problems in the controller). We are about to start testing on much larger Kafka broker hardware, however, and will be digging into this limitation.

    As a side note, you should keep an eye on the number of topics you have for a reason that is not immediately obvious. Zookeeper has a limit of 1 MB as the size of the data in a node. This also applies to the combined length of all the names of the child nodes. Because all of the topics exist as child nodes under /brokers/topics, there is a limitation here. If your topic names are all 50 characters long, and you have more than about 20,900 topics, you will hit this limitation. This could cause Zookeeper to fail entirely, or it could cause problems in Kafka. The guarantee is that it will cause problems.
  • Now that Kafka is running well, we can turn our attention to the topics. In general, there are two things to configure when it comes to topics: the retention, and the number of partitions. There are other things you can look at, such as the segment size, or how long until the segments are rolled, which may have application-specific concerns. But in large part, all we really care about is how long we keep the data, and how much we spread it out.

    Topics can be configured for retention by time, by size, or by key. There is a default broker-level setting for this, and it can be overridden per-topic. How you retain data is mostly application-dependent. We use a default retention of 4 days, and the reason for this is that in the normal state of affairs, consumers are caught up and reading from the end of the stream. We want enough retention so that if a problem happens with an individual application on the weekend, there is enough time to identify it, figure out what the problem is, resolve it, and catch back up before they fall off the end of their topic. We have certain types of data, such as some of the monitoring, which uses a shorter retention time because the data size is much larger and it gets fixed very quickly if there is ever a problem. We also have topics that are retained for much longer, up to a month, when there is a reason to because of how the application uses the data. The rule of thumb is to never hang on to more data than you really need. There are systems (such as HDFS) which are better designed for long-term storage of data.

    Partition counts are the tricky calculation. General guidance is to have fewer partitions, not more. This is because more partitions means more log segments, which is more file handles open, and more overhead in the brokers. At the same time, you need to make sure you have enough. There are several ways to look at this, all of which should be taken into account.

    Balancing over consumers – You must have at least as many partitions as you have consumers in the largest group for a topic. If a topic has 8 partitions, and you have 16 consumer instances, 8 of those consumers will be idle all the time.

    Balancing over brokers – If your number of topics is not a multiple of the number of brokers in your cluster, the topic cannot be evenly balanced over the brokers. In a cluster with a large number of topics, this is less of a concern because over all the topics you should have a good balance regardless. In cases where you get a dump of messages (high number of messages in a short period of time), balancing over the brokers is very important so you don’t swamp the network.

    Partition size on disk – This is one of our primary drivers in how we expand topics, as it is a good indication of how busy the topic is. We’ve picked a somewhat arbitrary threshold of 50 GB as the size of a single partition on disk on the brokers. Once a topic exceeds that, we increase the number of partitions (in general). This keeps the log segments of a reasonable size, which is good for recovering a crashed broker, and it also allows us to balance busy topics over more of the cluster.

    Through all of this, you also need to keep in mind application-specific requirements. You may have an application which is very concerned about message ordering, and only wants a single partition. You may have an application that is using keyed partitioning, and wants a high number of partitions so that they do not need to be expanded at any point (which would change the hashing of keys to partitions). This will often override other concerns. In a multi-tenant environment, the important thing is to have communication with the users, and a way of keeping track of these requirements so they are not forgotten.
  • In an environment with multiple Kafka clusters, you are often using the mirror maker application to replicate data between them. In addition, because mirror maker has both a consumer and a producer, it’s a useful case to look at when tuning both. If you want more information about using mirror maker for running Kafka clusters in tiers, I encourage you to look at one of my other presentations on multi-tier architectures that goes into more depth on the design and concerns around setting this up.

    With any consumer or producer, network locality is a big factor in performance. If your client is not in the same network as your Kafka cluster, you will have latency, bandwidth concerns, network partitions, and any number of other problems that you get when you have a lot of network hops in the way. With mirror maker, we need to choose whether we are going to locate it proximate to the cluster we are consuming from or the cluster we are producing too (as we use it most often for inter-datacenter replication). Our choice is always to locate it with the produce cluster. The reason for this is that if there is a problem with the produce side of the mirror maker, it will lose messages and the consumer will be continuing to consume messages and commit offsets. If there is a problem with the consumer, it will just stop. So we choose to put the higher risk of network problems on the consume side, rather than the produce side.

    With tuning the mirror maker consumer, you will mostly consider how much data you need to consume, and the number of streams. You need to have enough copies of mirror maker in a given pipeline to handle the peak traffic, and mirror maker will not operate at line speed because it needs to decompress and recompress every message batch. This is also why you should run more than one consumer stream in a single mirror maker copy, to take advantage of parallelism to get around some of this inefficiency. You will also want to look at the partition assignment strategy that is used when balancing consumers. There is a strategy available for wildcard consumers called “roundrobin” which provides a much more even balance of partitions than the standard assignment strategy. There are also improvements in the most recent mirror maker code to the speed with which the consumer rebalance is performed.

    On the producer side, you also should be running multiple streams. Where the consumer is responsible for decompressing message batches, the producer is responsible for compressing them again before sending to Kafka. You will also want to consider the number of in flight requests that are allowed between the producer and the Kafka cluster. A higher number will allow for greater throughput, but it will also introduce a higher risk of loss. When the leadership changes on a partition in the produce cluster, message batches that are in flight will be lost. It is also possible to improve this by changing the acknowledgement configuration on the producer, but this will have other performance concerns. Another parameter to look at is the linger time. The mirror maker producer will flush a batch to the producer based on either the producer reaching the byte size limit for a single batch, or by reaching the linger time. For busy topics, you will be subject to the size limit. For slow topics, you will be subject to the time limit. A higher linger time will allow the producer to assemble more efficient batches, with better compression (and the Kafka broker itself does not decompress and break up batches, so this affects your disk utilization on the brokers). It will also increase the amount of time it takes for messages to get from one cluster to the next. You will need to determine how important these factors are and strike a balance.

×