This is a talk given at ApacheCon 2015
If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community.
Note - there are a significant amount of slide notes on each slide that goes into detail. Please make sure to check out the downloaded file to get the full content!
So who am I, and why am I qualified to stand up here?
I am one-fourth of the Data Infrastructure Streaming SRE team at LinkedIn. We’re responsible for Kafka, Samza, and Zookeeper operations
SRE stands for Site Reliability Engineering. Many of you, like myself before I started in this role, may not be familiar with the title. SRE combines several roles that fit together into one Operations position
Foremost, we are administrators. We manage all of the systems in our area
We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together
And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them.
At the end of the day, our job is to keep the site running, always.
What are the things we are going to cover in this talk? We’ll start by talking about the basics of how Kafka works, very briefly, and move right into what a tiered architecture looks like along with the infrastructure tool we use for creating our tiers – mirror maker. I will cover performance tuning, specifically when it comes to laying out and managing tiered clusters. I’ll also talk about monitoring and other ways we assure our data gets where it is going, intact. Lastly, we’ll talk about what work is going on right now that will continue to improve the ecosystem for running large Kafka installations, and what you can do to get involved.
These are the numbers I presented this time last year, as far as how much data we push around in Kafka at LinkedIn. Over the last year, it’s changed significantly
We now have well over 1100 brokers in total in our 50+ clusters
Which are managing over 31,000 topics
With over 350 thousand partitions between them, not including replication
We’ve gone from 220 billion messages a day to over 875, and that was a slow day
There is now over 185 terabytes per day flowing into Kafka, an increase of almost 4 times
And consumers are reading over 675 terabytes per day out. Of course, those are both compressed data numbers
At peak, we’re receiving over 10 and a half million messages per second
For a total of 18 and a half gigabits per second of inbound traffic
And the consumers are reading over 70 gigabits per second at the same time
Again, this is compressed data. This is a fairly astonishing growth rate for the amount of data we are moving around with Kafka. Some of it comes from standing up new datacenters, so let’s move directly into what that looks like.
Let’s move right into what Kafka clusters look like, and what happens when we start organizing them into tiers.
I won’t be going into too much detail on how Kafka works. If you do not have a basic understanding of Kafka itself, I suggest checking out some of the resources listed in the Reference slides at the end of this deck.
Here’s what a single Kafka cluster looks like at LinkedIn. I’ll get into some details on the TrackerProducer/TrackerConsumer components later, but they are internal libraries that wrap the open source Kafka producer and consumer components and integrate with our schema registry and our monitoring systems.
Every cluster has multiple Kafka brokers, storing their metadata in a Zookeeper ensemble. We have producers sending messages in, and consumers reading messages out. At the present time, our consumers talk to Zookeeper as well and everything works well. In LinkedIn’s environment, all of these components live in the same datacenter, in the same network.
What happens when you have two sites to deal with?
Multiple datacenters is where this starts to get interesting. Here is an example of a layout that uses one Kafka cluster. We’re keeping the cluster in a single datacenter, because having it span datacenters is an entirely different level of complexity. In addition, Kafka has no provision for reading from the follower brokers, so you would still be crossing datacenters with your clients.
The problem with this layout should be quite obvious – if we lose Datacenter A, we’ve lost everything. Not only do we have concerns with network partitions between the datacenters cutting off access for one consumer or producer or another, we have no redundancy at all.
So to improve this situation, we’ll run a Kafka cluster in each of our primary datacenters, A and B. In this layout, C is a lower-tier datacenter where we don’t have producers of the data, only consumers. Consider it a backend environment where you run things like Hadoop.
Our producers all talk to the local Kafka cluster. Consumers in the primary datacenters talk to their local cluster as well. Now if we lose either datacenter A or B, the other datacenter can continue to operate. We’ve pushed more complexity on the consumers that need to access all of the data from both datacenters, however. They have to maintain consumer connections to both clusters, and they will have to deal with networking problems that come up as a result. In addition, latency in the network connection can manifest in strange ways in an application.
Now we iterate on the architecture one more time. We add the concept of an aggregate Kafka cluster, which contains all of the messages from each of the primary datacenter local clusters. We also have a copy of this cluster in the secondary datacenter, C, for consumers there to access. We still have cross-datacenter traffic – that can’t be avoided if we need to move data around. But we have isolated it to one application, mirror maker, which we can monitor and assure works properly. This is a better situation than needing to have each consumer worry about it for themselves.
We’ve definitely added complexity here, but it serves a purpose. By having the infrastructure be a little more complex, we simplify the usage of Kafka for our customers. Producers know that if they send messages to their local cluster, it will show up in the appropriate places without additional work on their part. Consumers can select which view of the data they need, and have assurances that they will see everything that is produced. The intricacies of how the data gets moved around are left to people like me, who run the Kafka infrastructure itself.
We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns.
The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring.
There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now.
The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
So how do we move all of these messages around? That falls to the Kafka mirror maker application, which is part of the open source project.
Mirror Maker’s sole job is to copy messages from one cluster to another, and it is the glue that ties a multi-tier architecture together. It consumes messages from one cluster, and puts them on an internal queue. It then pops messages off the queue and produces them to the target cluster.
Because of this architecture, there is no communication back from the producer to the consumer component. The only communication is if the queue fills up, which will cause all of the consumer streams to block. The only way this happens, however, is if the producer stops reading from the queue. If it is retrying due to a target cluster problem, it will eventually drop messages.
This is why it’s a best practice to place the mirror maker in the same network as the target cluster. If a mirror maker fails to consume messages, it will just not commit offsets and you will not lose messages. You will just slow down or stop consuming. If the mirror maker fails to produce messages, then more likely than not it will start dropping messages. This means we want to keep network problems on the consumer side as much as possible.
Another thing to consider is that Kafka has no way to prevent loops. If you have two mirror makers, one that copies messages from cluster A to cluster B, and another that copies in the reverse direction, they will duplicate messages if configured with the same topics. It’s a great way to fill up your disk very quickly.
There are some rules that should be followed when setting up tiers of Kafka clusters. The first is that you always produce to local clusters, never to the aggregate clusters themselves.
This is important. NEVER do this! If you produce to an aggregate tier, your aggregate tiers will be out of sync with each other. Part of the idea behind an aggregate tier is that it contains everything. If you change this assumption, you are doing something different.
Don’t do it.
Not even once.
Past that first rule, keep in mind that not everything needs to be in the aggregate cluster. The first part of this is that not everything can be mirrored. Log compacted topics, in particular, do not play nice with mirror maker. Queuing topics which are internal to applications most often do not need to be aggregated as well.
But when you do aggregate, make sure you are consistent. If a topic shows up in your aggregate tier, it needs to be mirrored from all of the source clusters, not just a subset. If you do not do this, you break a promise to your customers that aggregate contains the entirety of what was produced to local.
Lastly, consider very carefully if you actually want to have aggregate clusters in your front-end datacenters. We have them, but I often wish we didn’t. If you have the aggregate view of data in your production datacenter, you can inadvertently encourage the creation of single-master services. That is, an application which runs in only one datacenter. This happens with Kafka because there is no way to equate offsets between two different clusters, even if those clusters have the same topics and message content. Applications like this are troublesome when you want to shut down a datacenter, due to problems or maintenance.
That said, there are use cases where it is hard to avoid using the aggregate view. Search services, which need to modify their indices based on what happens everywhere, are an example. But if you can move this type of processing to a second-tier network, and then copy the results back out to a front-end application, it can be worth the additional complexity.
One of the first issues with mirror maker is that as you add new local clusters, you geometrically increase the number of paths you have to mirror messages over. If each path is a mirror maker instance, this gets out of control very quickly. Thankfully, this has a simple solution – you can configure a mirror maker with multiple consumers. We actually moved away from this design early on, but are now looking at moving back since we have more sites to manage and the mirror maker application has matured quite a bit.
Another big problem is that the mirror maker producer can lose messages. If it is sending messages to a cluster and there is a leadership change, it will lose batches that are in flight. This presents a big problem for clients who want to assure the delivery of their messages, as they can set acks=-1 in their application to be safer but all bets are off when there is a mirror maker involved to move the messages to an aggregate cluster. The solution here is to use a fewer number of batches inflight at any given time, and to use acks=-1 on the mirror maker producer as well. We are currently testing this configuration change so that we can provide a better end-to-end service.
Mirror maker is forced to decompress every message batch upon consuming it, and then recompress it on sending it to the target cluster. The reason for this is that the mirror maker has no idea, looking at the compressed batch, if that batch contains keyed messages. If it does, the mirror maker needs to honor that and send the messages to specific partitions. As we use few keyed topics, this is a huge waste of resources. A possible solution is to flag the compressed batch as to whether or not it contains keyed messages, and only decompress it if it does. This doesn’t have a solution right now, but there’s ongoing work to address it.
In addition to this, mirror maker cannot preserve the partition of a message. Within a single Kafka cluster, you can be assured that the order of messages in a partition is the order in which they were produced. You also know that for a keyed topic, a message with a specific key will always show up in the same partition. Mirror maker essentially shuffles all of the partitions again: unkeyed messages will end up mingled with messages from other partitions, and keyed messages will still all go to the same partition, but it will not necessarily be the partition they were in in the source cluster. One way to resolve this would be an “identity” mirror maker, which preserves partitioning. This isn’t a simple solution, however, because you need to take into account the case where you are coming from two local clusters with 8 partitions each into an aggregate cluster with 16 partitions. Does it fail? Interleave? Offset? The desired behavior will depend largely on the user.
More components means that we have more places to poke and prod to get the most efficiency out of our system. With multiple tiers most of this revolves around making sure the sizes of everything are correct.
It may not strictly be tuning, but having your Kafka clusters be the right size is the first place you need to start. It can be a little difficult to determine exactly how large your cluster should be, but there are a few major points you have to consider.
First, how much disk space do you have on each broker, and how much space do you need to maintain message retention? One of our rules is that we keep disk usage for the log segments partition to under 60%. This allows us enough headroom to move partitions around when needed (especially because retention for the partition resets when you move it). The next concern is how much network you have. If you have gigabit network interfaces, and your Kafka cluster is going to receive 5 gigabits per second of traffic at peak, you need to take that into account. CPU, memory, and disk I/O all take a back seat to these concerns, because they mostly drive how fast your cluster operates, not if it operates at all.
Size your local clusters first, then you can consider how large your aggregate cluster needs to be. For the most part, taking the number of local clusters you have, multiplied by the number of brokers in each local cluster, will give you the size of your aggregate cluster. If you have 3 local clusters with 10 brokers each, your aggregate cluster should be at least 30 brokers. You’ll also need to take into account the number of consumers of the aggregate messages. This can affect how much network bandwidth you need, which will change the number of brokers that you need.
You will also need to size your topics appropriately. This tends to be a topic of much discussion, because there are so many variables that get considered here. For example:
How many brokers do you have? Do you want to perfectly balance the topic across the brokers? If so, you should have then number of partitions be a multiple of the number of brokers
How many consumers does your topic have in its largest consumer group? If you have 8 partitions and 16 consumers, 8 of those consumers will be sitting idle
Does your application have specific requirements around partition counts? If you are using keyed messages, you may want to go with a larger number of partitions to start with so you don’t have to expand it later based on other criteria
Another concern we have is around keeping the size of a partition on disk manageable. Very large partitions can be harder to keep balanced in a cluster to make sure each broker is doing its fair share of work. We use a guideline internally of making sure partitions do not exceed 50 gigabytes on disk. When they get close or exceed that, we expand the topic (provided it is not a keyed topic).
Once again here, the partition counts in your aggregate cluster should be a simple calculation. For most topics, we take the number of partitions in the local cluster, multiply it by the number of local clusters, and that is the number of partitions in the aggregate cluster. You also want to check your partition counts regularly, especially if you use automatic topic creation. An imbalance in the number of partitions between clusters can bog down mirror maker very quickly.
Another thing to consider is how much retention you need. There’s nothing that says that retention has to be the same in the local and aggregate tiers. You may want to retain messages longer in aggregate, keeping them in the local tier only long enough to get them out. Just remember that this can change your sizing calculation for the aggregate clusters. If you have twice the retention, you may very well need twice the number of brokers.
We introduced another component with Mirror Maker, so we also need to size that appropriately. When we talk about sizing, we are talking about the number of copies of mirror maker with the same consumer and producer configuration, which is one pipeline.
With mirror maker, it’s mostly about network throughput. Because of the decompression and recompression of message batches, you’re probably never going to run at wire speed. This means that you can easily co-locate multiple mirror makers on one set of servers to efficiently use them. You should also make sure that you are running more copies of mirror maker than you need to handle your peak traffic. If you fall behind, such as if you have a network partition for a period of time, you want to be able to catch up quickly. If you don’t have excess capacity, it will take a long time, or you will just continue to fall behind. You should also run multiple consumer and producer streams in each copy of mirror maker, as this will allow you to take advantage of the parallel nature of having multiple partitions. If you can process 15 megabytes per second at peak on one stream, you won’t get 30 with two streams, but you’ll do a lot better. We run with 8 consumer streams and 4 producer streams, and it works out pretty well. We also co-locate up to 11 mirror makers on one host, each for a separate pipeline.
There are a few other parameters to consider. One is the partition assignment strategy. We asked our developers to add a round robin strategy of balancing partitions for wildcard consumers, like mirror maker. This provides a nice balance of partitions across your mirror makers, and should almost certainly be the configuration you use. You should also set the number of in flight requests per connection. A higher number will make things go faster, but it will also mean more loss of messages if mirror maker breaks. The linger time for the producer is another thing to look at. A longer linger time will allow mirror maker to assemble more efficient batches of messages, but it will also mean that messages take a little longer to get through the pipeline.
Weigh which tradeoffs are the right ones for you.
Another thing you can do is to provide separate paths for different topics between the same two clusters. We do this at LinkedIn because not all topics are created equal.
We have high priority topics. For you, this could be topics that change search results. For us, it’s mostly topics that are used for hourly or daily reporting. Most other topics, especially headed to Hadoop, we’re OK if they’re a little delayed. But if the hourly report to the executives is delayed, you can be sure that I’m fielding phone calls as to exactly what is broken and when it will be fixed.
For these topics, we have two separate mirror maker pipelines that run in parallel. The high priority mirror maker has a small whitelist of topics, and the other mirror maker has a blacklist that contains the same topic list. This way a bloated topic that is not considered a priority will not delay the most important topics. It also means that the priority mirror maker starts up faster, and takes less time to catch up when there is a problem. These are all very good things if it means the CEO doesn’t know my name.
As the people running the Kafka infrastructure, we take on more responsibility by moving the complexity to our environment. This means that we need to be vigilant to make sure that the promises we make to our customers are kept. So we monitor the infrastructure to make sure everything is running properly, but we also need to make sure that when it is running, that it is doing the right thing. Namely, moving all the messages.
Many of us use Kafka for monitoring applications. At LinkedIn, every application and server writes metrics and logs to Kafka. We have central applications that read out these metrics and provide pretty graphs, thresholding, and other tools to make sure that everything is running properly within LinkedIn. Kafka itself is no exception, which leads to this…
As soon as I say “monitoring Kafka with Kafka”, we know this is not a good thing
This means that we need to have a way of monitoring Kafka that does not rely on Kafka itself, at least for the critical metrics that tell us whether or not Kafka is working. For metrics that we look at over a longer term, such as growth metrics, it’s OK to funnel those through Kafka into the same system. But if your Kafka cluster for metrics dies, you will hear nothing but silence from your alerting system. We’ve written a monitoring system in our environment that watches the key metrics in Kafka and provides a completely separate path for thresholding and notifications.
When it comes to things specific to tiered architectures, what you need to monitor is the health of the mirror maker application. You want a healthcheck to know that mirror maker is running, and you also want to monitor the consumer lag to make sure it is not falling behind.
The bigger question, which basic monitoring cannot answer, is whether or not the data in your tiers is intact. Does your aggregate tier contain all of the messages it is supposed to? For this, we need a more detailed audit of the messages that are produced.
In LinkedIn’s environment, the producers of messages use an internal library called TrackerProducer. This library takes care of proper Avro encoding of messages, interfacing with a separate schema-registry for schema lookups. This library also starts the trail of audit information for messages. Every 10 minutes it produces a message into a special audit topic on the Kafka cluster with a count of how many messages were produced in the last 10 minutes.
Additionally, the Kafka cluster has an audit consumer which reads all messages out of the cluster and publishes back audit topic messages with counts of how many messages were produced into each topic for each 10 minute period. Combining this with the producer audit, we can be assured that all messages that were attempted to be produced actually made it into Kafka. If the counts do not match, we know that there is a problem with one or more producers.
Moving down to the aggregate cluster, there is another audit consumer instance which allows us to now compare the number of messages in the aggregate cluster to the number in the local cluster, and the number that were produced. We also introduce the concept of an auditing consumer here, which writes audit messages about how many messages it consumed for each 10 minute period. This completes an end-to-end accounting, from producer to consumer, of every message. A special consumer reads all audit messages out of the Kafka clusters and writes it to a database for performing comparisons.
In every message schema, we have a common header which provides, in part, the information needed to generate audit information. This includes a timestamp, set by the producer, specifying when the message was sent, as well as the application name and the hostname that the message originated at. This is utilized by the audit consumers when they read messages in order to count up messages by enough criteria to pinpoint a problem.
The special audit messages themselves have start and end timestamps, which describe the 10 minute bucket they cover. They also have the topic name for which they apply, as well as a tier. The tier is a string which describes where this audit information came from. If it was sent in by a producer, the tier is always “producer”. This allows us to have a single tier that covers the production of all messages, since we have an environment where different services can produce the same type of message. The audit consumers use tier names that are specific to the Kafka cluster they are reading from, and consumers can specify their own tier name. Finally, there is a count of the number of messages in this bucket.
Of course, the audit messages also have the common message header, so we can audit the audit if needed.
We have a few concerns with our audit system. One that comes up fairly frequently is that we are only counting messages, we’re not considering the content of the message. This means that if we duplicate one message, and lose a different message, we still think we don’t have a problem. The reality of the situation is that this really doesn’t happen, at least not with any exactness, at the number of messages we are passing. Additionally, if we wanted to we could use the data we are storing to audit messages for a particular service, or a particular server, and get much more granular. Lastly, one of the largest consumers of our audited messages, Hadoop, performs additional checks on the messages to trim out duplicates and check the message content using other fields in the header.
Another one that has come up recently is that we do not audit all consumers. Most consumers will just monitor their lag to make sure they are not falling behind, but they do not check to make sure they read every message that was produced. Hadoop is the exception because of the importance and variety of work that is done there – it uses an auditing consumer that writes back audit messages so we consider it another tier in audit. We found that the way our relational database is set up currently would, most likely, not be able to handle the amount of activity it would get if every consumer started using the auditing consumer. We’ve been working on changes to the data backend to support this.
We also cannot properly audit complex message flows. For a given topic, each tier must have 100% of the messages in it. This means that all of our local tracking clusters have the same tier name, tracking-local, whereas each aggregate cluster has a site specific tier name that differentiates it from the other aggregate clusters. If there’s a problem in the local tier, we don’t immediately know which datacenter the problem is in without further investigation. We also have problems with topics that take different paths to get to aggregate, which can come in when you have special clusters for outside clients. What we’d like to do here is to have an audit infrastructure that has knowledge of the mirror maker layout, and the whitelists and blacklists that each mirror maker is configured for, so that we can more easily determine exactly where a problem is when it occurs. This is a longer term project that we’re only starting to plan out right now.
Specifically to address concerns with running multiple tiers, there are several things we are looking for improvements in. One of the biggest is access controls. Right now, we have no way to prevent clients from producing to the aggregate clusters, and this can generate big problems with how we access and audit data. In a more general sense, we also need the ACLs to make sure we know who is producing to what topics, and to secure any topics that should be limited access. This could include topics that have details like credit cards or health information.
We also need to have encryption. Starting out, encryption of the data in motion is the most critical part. We have the luxury of working entirely within our own networks and backbones, but even at that we cannot be assured of the connections between our datacenters. We are moving towards first encrypting these communications, and then making sure every client connection is encrypted. Thankfully, both this and the ACLs are currently being worked on by the open source developers through a series of proposals and tickets that will address authentication, authorization, and TLS encryption. Later on, we will need to consider encryption of data at rest. This, however, can be handled entirely in the clients where it is needed.
Another piece that we need is quotas. We have no way right now to prevent one bad actor from performing a denial of service, even unintentionally, against a cluster. This is a particular concern for us when we have Hadoop jobs either consuming or producing, as they can spin up many many clients to do their work. Right now, we mitigate this by having separate clusters for Hadoop to work in, but we want to collapse this as much as possible to avoid duplication of messages. It also creates yet another set of things that can fail. Using quotas, we can set limits on how much damage a client can do, and assure that the right application gets penalized if they cause a problem, and not everyone else. This is also being worked on in an open source proposal from LinkedIn.
Lastly, we need improvements with the way mirror maker does decompression and recompression of message batches. This is pretty obvious – we want to avoid overhead work wherever possible. As of yet, we don’t have a good proposed solution for how to handle it that doesn’t involve trickery with the existing protocol definition. There have been some recent improvements with decompression in Kafka, both driven by us and by other developers, but more work is needed. It’s something we’re talking a lot about internally.
So how can you get more involved in the Kafka community?
The most obvious answer is to go apache.kafka.org. From there you can
Join the mailing lists, either on the development or the user side
You’ll find people on the #apache-kafka channel on Freenode IRC if you have questions
We also coordinate meetups for both Kafka and Samza in the Bay Area, with streaming if you are not local
You can also dive into the source repository, and work on and contribute your own tools back.
Kafka may be young, but it’s a critical piece of data infrastructure for many of us.