Multitenancy: Kafka clusters for everyone at LINE

Multitenancy: Kafka clusters
for everyone at LINE
Yuto Kawamura - LINE Corporation

Speaker introduction
— Yuto Kawamura
— Senior Software Engineer at
LINE
— Leading project to redesign
microservices architecture
w/ Kafka
— Apache Kafka Contributor
— Speaker at Kafka Summit SF
2017
— Also at Kafka Meetup #3

Outline
— Kafka at LINE as of today (2018.04)
— Challenges on multitenancy
— Engineering for achieving multitenancy

Kafka at LINE as of today
(2018.04)

We have more clusters
— Added more clusters since last
year to support:
— Different DCs
— Security sensitive data w/
SASL+TLS
— They are separated by "purposes"
but not by "users" ; our
multitenancy strategy
— Fewer clusters allow us to
concentrate our engineering
resources for maximizing their
performance
— They're concepturally the "Data
Hub" too

One cluster has many users
— Topics:
— 100 ~ 400+ per cluster
— Users:
— few ~ tens per cluster
— Messages: 150 billion messages / day in largest cluster
— 3~ million / sec on peak
— None of messages are supposed to lost because all
usages are somehow related to service

For doing multitenancy, we have to ensure:
— Certain level of isolation among client workloads
— Cluster is abusing-client proof
— Can track on which client sending particular request
— We have to be conﬁdent about what we do to say
"don't worry" for people saying "we want a dedicated
cluster only for us!"

Engineering for achieving
multitenancy

Request Quota
— It's more important to manage number of requests over
incoming/outgoing byte rate
— Kafka is amazingly strong at handling large data if they
are well-batched
— => For consumers responses are naturally batched
— => Main danger exists on Producers which configures
linger.ms=0
— Starting from 0.11.0.0, by KIP-124 we can configure request
rate quota 2
2
https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas

Request Quota
— Manage master of cluster conﬁg in YAML inside Ansible repository
— Apply all at once during cluster provisioning by kafka_config ansible module
(developed internally)
— Can tell latest conﬁg on cluster w/o quierying cluster, can keep change history
on git
---
kafka_cluster_configs:
- entity_type: clients
configs:
request_percentage: 40
producer_byte_rate: 1073741824
- entity_type: clients
entity_name: foobar-producer
configs:
request_percentage: 200

Slowlog
— Log requests which took longer than certain threshold to process
— Kafka has "request logging" but it leads too many of lines
— Inspired by HBase's
# RequestChannel.scala#updateRequestMetrics
+ slowLogThresholdMap.get(metricNames.head).filter(_ >= 0).filter { v =>
+ val targetTime = requestId match {
+ case ApiKeys.FETCH.id => totalTime - apiRemoteTime
+ case _ => totalTime
+ }
+
+ targetTime >= v
+ }.foreach { _ =>
+ requestLogger.warn("Slow response:%s from connection %s;totalTime:%d...
+ .format(requestDesc(true), connectionId, totalTime, requestQueueTime...
+ }
[2016-12-26 16:04:20,135] WARN Slow response:Name: FetchRequest;
Version: 2 ... ;totalTime:1817;localTime: ...

Slowlog
— Thresholds can be changed dynamically through JMX console for each request type

The disk read by delayed consumer problem
— Detection: 50x ~ 100x slower response time in 99th %ile Produce response time
— Disk read of certain amount
— Network threads' utilization was very high

Suspecting sendﬁle is taking long...
— Because: 1. disk read was occuring at that time, 2. network threads' utilization was high
$ stap -e ‘(script counting sendfile(2) duration histogram)’
value |---------------------------------------- count
0 | 0
1 | 71
2 |@@@ 6171
16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472
32 |@@@ 3418
2048 | 0
...
8192 | 3
— Normal: 2 ~ 32us
— Outliers: 8ms ~
— (About SystemTap, see my previous presentation 3
)
3
https://www.slideshare.net/kawamuray/kafka-meetup-jp-3-engineering-apache-kafka-at-line

Kafka broker's thread model
— Network threads (controlled by
num.network.threads) takes
read/write of requests/
responses
— Network threads hold
established connections
exclusively for event-driven IO
— Request handler threads
(controlled by num.io.threads)
takes request processing and
IO between block device
except sendﬁle(2) for Fetch
requests

When Fetch request for data that doesn't present in
page cache occurs...

Problem deﬁnition
— network threads contains potentially-blocking ops
while it's supposed to work as event loop
— and we have no way to know if upcoming sendfile(2)
blocks awaiting disk read or not

It was the one of the worst issues we had because of:
— Completely breaks resource isolation among all
clients including producers
— Occurs naturally when one of consumers slows down
— Have to communicate with users every time to ask for
ﬁx
— Occurs 100% when one broker restores log data from
leader

Solution candidates
— A: Separate network threads among clients
— => Possible, but a lot of changes required
— => Not essential because network threads should be
completely computation intensive
— B: Balance connections among network threads
— => Possible, but again a lot of changes
— => Still for ﬁrst moment other connections will get
affected
— C: Make sure that data are ready on memory before the
response passed to the network thread

To make sure non-blocking sendﬁle(2) in network
threads...
— The target data must be available on page cache

How?
NAME
sendfile - transfer data between file descriptors
SYNOPSIS
#include <sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
— sendfile(2) on Linux doesn't accepts flags for controlling
it's behavior
— Interestingly FreeBSD has such, by contribution from nginx
and Netflix 1
1
https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/

So have to;
1. Pre-read data not available on page cache from disks,
2. and conﬁrm the page's existence before passing
response to network threads

sendﬁle(2) to the dest /dev/null
— Calling channel.transferTo("/dev/null") (== sendfile(/
dev/null)) in request handler thread might populates
page cache?
— Tested out, and ﬁgured out there's no noticeable
performance impact

How could it be that harmless?
— Linux kernel internally uses splice to implement sendfile(2)
— splice requests struct file_operations to handle splice
— struct file_operations null_fops just iterates list of page pointers but not each
bytes
— => Iteration count is SIZE / PAGE_SIZE(4k)
# ./drivers/char/mem.c
static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf,
struct splice_desc *sd)
{
return sd->len;
}
static ssize_t splice_write_null(struct pipe_inode_info *pipe,struct file *out,
loff_t *ppos, size_t len, unsigned int flags)
{
return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_null);
}

Patching broker to call sendﬁle(/dev/null) in request
handler threads
# FileRecords.java
@SuppressWarnings("UnnecessaryFullyQualifiedName")
private static final java.nio.file.Path DEVNULL_PATH = new File("/dev/null").toPath();
public void prepareForRead() throws IOException {
long size = Math.min(channel.size(), end) - start;
try (FileChannel devnullChannel = FileChannel.open(DEVNULL_PATH,
java.nio.file.StandardOpenOption.WRITE)) {
channel.transferTo(start, size, devnullChannel);
}
}
— Still not fully-portable because it assumes underlying
kernel's implementation detail (so we haven't
contributed...)

... and more to minimize impact of increased syscall...
# Log.scala#read
@@ -585,6 +586,17 @@ class Log(@volatile var dir: File,
if(fetchInfo == null) {
entry = segments.higherEntry(entry.getKey)
} else {
+ // For last entries we assume that it is hot enough to still have all data in page cache.
+ // Most of fetch requests are fetching from the tail of the log, so this optimization
+ // should save call of readahead() + mmap() + mincore() * N significantly.
+ if (!isLastEntry && fetchInfo.records.isInstanceOf[FileRecords]) {
+ try {
+ info("Prepare Read for " + fetchInfo.records.asInstanceOf[FileRecords].file().getPath)
+ fetchInfo.records.asInstanceOf[FileRecords].prepareForRead()
+ } catch {
+ case e: Throwable => warn("failed to prepare cache for read", e)
+ }
+ }
return fetchInfo
}
— Perform cache warmup only if the read segment IS NOT the latest
— => can save unnecessary syscalls for 99% of Fetch requests

Conclusion
— Having fewer clusters enables us to concentriate on
reliability engineering and essential troubleshootings/ﬁxes
— Preventive engineering enables us to keep operating
Kafka clusters in highest reliability even under high and
inexplicable load
— We've had some failures in development cluster, but
never in production cluster
— The important in operating on-premise multitenancy; not
necessary to prevent 100% of failure, but never let the
same hole to be punched again

End of presentation.
Questions?

Multitenancy: Kafka clusters for everyone at LINE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multitenancy: Kafka clusters for everyone at LINE

Similar to Multitenancy: Kafka clusters for everyone at LINE (20)

Recently uploaded

Recently uploaded (20)

Multitenancy: Kafka clusters for everyone at LINE