At LinkedIn, the Kafka infrastructure is run as a service: the Streaming team develops and deploys Kafka, but is not the producer or consumer of the data that flows through it. With multiple datacenters, and numerous applications sharing these clusters, we have developed an architecture with multiple pipelines and multiple tiers. Most days, this works out well, but it has led to many interesting problems. Over the years we have worked to develop a number of solutions, most of them open source, to make it possible for us to reliably handle over a trillion messages a day.
So who am I, and why am I qualified to stand up here?
I am a member of the Data Infrastructure Streaming SRE team at LinkedIn. We’re responsible for Kafka and Zookeeper operations, as well as Samza and a couple iterations of our change capture systems.
SRE stands for Site Reliability Engineering. Many of you, like myself before I started in this role, may not be familiar with the title. SRE combines several roles that fit together into one Operations position
Foremost, we are administrators. We manage all of the systems in our area
We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together
And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them.
At the end of the day, our job is to keep the site running, always.
What are the things we are going to cover in this talk? I’m going to assume some basic knowledge of what Kafka is and how it works, so I won’t be covering the basics. I’ll start by describing the Kafka pipelines we have set up at LinkedIn in our multi-tenant environment. This will transition into the tier architecture that many of those pipelines use. But I’ll spend most of our time on the interesting problems that we’ve run into in running Kafka at such a large scale. We’ll wrap up talking about a couple of the things that we’re working on now, and hopefully have some time for Q&A
I won’t be going into too much detail on how Kafka works. If you do not have a basic understanding of Kafka itself, I suggest checking out some of the resources listed in the Reference slides at the end of this deck.
Here’s what a single Kafka cluster looks like at LinkedIn. I’ll get into some details on the TrackerProducer/TrackerConsumer components later, but they are internal libraries that wrap the open source Kafka producer and consumer components and integrate with our schema registry and our monitoring systems.
Every cluster has multiple Kafka brokers, storing their metadata in a Zookeeper ensemble. We have producers sending messages in, and consumers reading messages out. At the present time, our consumers talk to Zookeeper as well and everything works well. In LinkedIn’s environment, all of these components live in the same datacenter, in the same network.
What happens when you have two sites to deal with?
Now we iterate on the architecture. We add the concept of an aggregate Kafka cluster, which contains all of the messages from each of the primary datacenter local clusters. We also have a copy of this cluster in the secondary datacenter, C, for consumers there to access. We still have cross-datacenter traffic – that can’t be avoided if we need to move data around. But we have isolated it to one application, mirror maker, which we can monitor and assure works properly. This is a better situation than needing to have each consumer worry about it for themselves.
We’ve definitely added complexity here, but it serves a purpose. By having the infrastructure be a little more complex, we simplify the usage of Kafka for our customers. Producers know that if they send messages to their local cluster, it will show up in the appropriate places without additional work on their part. Consumers can select which view of the data they need, and have assurances that they will see everything that is produced. The intricacies of how the data gets moved around are left to people like me, who run the Kafka infrastructure itself.
We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns.
The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring.
There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now.
The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
More components means that we have more places to poke and prod to get the most efficiency out of our system. With multiple tiers most of this revolves around making sure the sizes of everything are correct.
This is as good a time as any for a little self-promotion. Many of the questions around how to set up and lay out Kafka clusters, including specific performance concerns and tuning, are covered in this fine book that I am co-authoring. You’ll also find a trove of information about client development, stream processing, and a variety of use cases for Kafka.
We currently have 4 chapters complete, and it’s available from O’Reilly under their early access program. We expect to have the book completed late this year, or early next, with chapters being released as soon as we can write them.
Many of us use Kafka for monitoring applications. At LinkedIn, every application and server writes metrics and logs to Kafka. We have central applications that read out these metrics and provide pretty graphs, thresholding, and other tools to make sure that everything is running properly within LinkedIn. Kafka itself is no exception, which leads to this…
As soon as I say “monitoring Kafka with Kafka”, we know this is not a good thing
For the broker, what are the critical metrics that I’m keeping an eye on every day?
Bytes in, bytes out, and messages in are all critical metrics for us from a growth point of view. While we don’t alert on these, we do keep an eye on them because they help us to understand how the usage of the cluster is growing over time, and they let us plan for the next expansion. You may ask why I don’t have messages out on this list. It’s because there is no messages out metric. Kafka consumers read batches of messages, not single messages, and it’s not easy for Kafka to count messages on the outbound side. There’s a metric on the number of fetches, but it’s less interesting to me.
For partitions, we start with the number of partitions per broker, and the number of leader partitions per broker. As we know, there is a single broker responsible for leadership for a given partition. In a healthy cluster, I want to make sure that each broker has approximately the same number of partitions, and that each broker is leading about 50% of those because we have a replication factor of 2 for most things. We can also see this reflected in the bytes rates, because if the partitions are imbalanced, the bytes rates will be as well. This gives us uneven load and that can cause a lot of problems.
More importantly though, we monitor the number of under replicated partitions that each broker is reporting. I’m going to get into this in much more detail in a few slides, but this indicates the number of partitions that the broker is leader for where at least one of the replicas has fallen behind. This is the single most important metric to monitor and alert on. It indicates a number of problems and a single alert here will provide coverage of most Kafka issues.
Lastly, there are metrics on the thread pool usage, both network and request pools, as well as rate and time metrics on the different types of requests. These are all examples of metrics that are good to have, but they’re difficult to alert on. If you are able to establish a good baseline on some of the request time metrics, I do recommend doing it, however, as rising request times can indicate a problem that is building up, and you may be able to see it before it becomes under replicated partitions.
Buried in the middle there is the “max dirty percent” metric. This is a measurement of how many log segments are able to be compacted that are not currently compacted. Right now, this is the only way to monitor the health of log compaction within Kafka, which is critical for the consumer offsets topic at the very least. If the thread doing log compaction dies (which it can do frequently), the only way you will know is by this metric increasing and staying high. Normal behavior is for the metric to spike up and immediately drop back down again.
There are a number of things that can be improved upon, both in the brokers and in the mirror maker, to make it easier to set up and manage multiple datacenters.
Another big problem is that we are using RAID and providing a single mount point to the Kafka brokers for a log dir. This is because there are some issues with the way JBOD is handled in the broker. Specifically, the brokers assign partitions to log dirs by round robin, not taking into account current size. In addition, there are no administrative functions to move partitions from one directory to another. And if a single disk fails, the entire broker fails. If JBOD was more robust, we could have replication factors of 3 or 4 without an increase in hardware cost, which would allow us to have “no data loss” configurations.
The big improvement to mirror maker is the creation of an identity mirror maker, which would keep message batches together in the exact same partition from source to target cluster. This would completely eliminate the compression overhead from the mirror maker, making it much faster and more efficient. Of course, this requires maintaining the partitions counts in the clusters properly, and allowing the mirror maker to increase partition counts in a target cluster if needed.
That leads into the idea of multi-cluster management. While there are a couple people making some headway on this in the open source world, we still lack a solid interface for managing Kafka clusters as part of an overall infrastructure. This would include maintaining topic configurations across multiple clusters and easily configuring and visualizing the mirror maker links between them.
Another piece needed is better client monitoring overall. Burrow provides us with a good view of what the consumers are doing, but there’s nothing available yet for producer client monitoring. We, of course, have our internal audit system for this. And other companies have their own versions as well. It would be nice to have an open source solution that anyone can use for assuring that the producers are working properly.
We could also use better end-to-end monitoring of our Kafka clusters, so we can know that they are available. We have a lot of metrics that can track information about the individual components, but without a client view of the cluster, we don’t know if the cluster is actually available. We also have a hard time making sure that the entire pipeline is working properly. There’s not a lot available for this right now, but watch this space…
So how can you get more involved in the Kafka community?
The most obvious answer is to go apache.kafka.org. From there you can
Join the mailing lists, either on the development or the user side
You’ll find people on the #apache-kafka channel on Freenode IRC if you have questions
We also coordinate meetups for both Kafka and Samza in the Bay Area, with streaming if you are not local
You can also dive into the source repository, and work on and contribute your own tools back.
Kafka may be young, but it’s a critical piece of data infrastructure for many of us.