Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE

Kafka Multi-Tenancy - 160
Billion Daily Messages on One
Shared Cluster at LINE
Yuto Kawamura - LINE Corporation

Speaker introduction
Yuto Kawamura
Senior Software Engineer at LINE
Leading a team for providing a
company-wide Kafka platform
Apache Kafka Contributor
Speaker at Kafka Summit SF
2017 1
1
https://kafka-summit.org/sessions/single-data-hub-
services-feed-100-billion-messages-per-day/

LINE
Messaging service
164 million active users in
countries with top market share
like Japan, Taiwan, Thailand and
Indonesia.2
And many other services:
- News
- Bitbox/Bitmax -
Cryptocurrency trading
- LINE Pay - Digital payment
2
As of June 2018.

Kafka platform at LINE
Two main usages:
— "Data Hub" for distributing data to other services
— e.g: Users relationship update event from
messaging service
— As a task queue for buffering and processing business
logic asynchronously

Kafka platform at LINE
Single cluster is shared by many independent services
for:
- Concept of Data Hub
- Efﬁciency of management/operation
Messaging, AD, News, Blockchain and etc... all of their
data stored and distributed on single Kafka cluster.

From department-wide to company-wide platform
It was just for messaging service. Now everyone uses it.

Broker installation
CPU: Intel(R) Xeon(R) 2.20GHz x 20 cores (HT) * 2
Memory: 256GiB
- more memory, more caching (page cache)
- newly written data can survive only 20 minutes ...
Network: 10Gbps
Disk: HDD x 12 RAID 1+0
- saves maintenance costs
Kafka version: 0.10.2.1 ~ 0.11.1.2

Requirements doing multitenancy
Cluster can protect itself against abusing workloads
- Accidental workload doesn't propagates to other
users.
We can track on which client is sending requests
- Find source of strange requests.
Certain level of isolation among client workloads
- Slow response for one client doesn't appears to
another client.

Protect cluster against abusing workload - Request
Quota
It is more important to manage number of requests over
incoming/outgoing byte rate.
Kafka is amazingly durable for large data if they are well-batched.
=> Producers which configures linger.ms=0 with large number of
servers probably leads large amount of requests
Starting from 0.11.0.0, by KIP-124 we can configure request rate
quota 3
3
https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas

Protect cluster against abusing
workload - Request Quota
Basic idea is to apply default
quota for preventing single
abusing client destabilize the
cluster as a least protection.
*Not for controlling resource
quantity for each client.

Track on requests from clients - Metrics
— kafka.server:type=Request,user=([-.w]+),client-
id=([-.w]+):request-time
— Percentage of time spent in broker network and I/O
threads to process requests from each client
group.
— Useful to see how much of broker resource is being
consumed by each client.

Track on requests from clients -
Slowlog
Log requests which took longer
than certain threshold to
process.
- Kafka has "request logging"
but it leads too many of lines
- Inspired by HBase's
Thresholds can be changed
dynamically through JMX console
for each request type.

Isolation among client workloads
Let's give a look through the actual troubleshooting.

Detection
Detection: 50x ~ 100x slower
response time in 99th %ile
Produce response time.
Normal: ~20ms
Observed: 50ms ~ 200ms

Finding #1: Disk read
coincidence
Coincidence disk read of certain
amount.

Finding #2: Network threads got
busier
Network threads' utilization was
very high.
Metrics:
kafka.network:type=SocketServer,n
ame=NetworkProcessorAvgIdlePercen
t

Request handling in Kafka broker
Two thread layers:
Network Threads: Reads/Writes request/response from/to client sockets.
Request Handler Threads: Processes requests and produces response object.

Request handling - Read Request

Request handling - Write Response

Network thread runs event loop
— Multiplex and processes assigned client sockets sequentially.
— It never blocks awaiting IO completion.
=> So it makes sense to set num.network.threads <= CPU_CORES

When Network threads gets busy...
It means either one of:
1. Really busy doing lots of work. Many requests/
responses to read/write
2. Blocked by some operations (which should not
happen in event loop in general)

Response handling of normal
requests
When response is in queue, all
data to be transferred are in
memory.

Exceptional handling for Fetch
response
When response is in queue, topic
data segments are not in
userspace memory.
=> Copy to client socket directly
inside the kernel using
sendﬁle(2) system call.

What if target data doesn't exists in page cache?
Target data in page cache:
=> Just a memory copy. Very fast: ~ 100us
Target data is NOT in page cache:
=> Needs to load data from disk into page cache ﬁrst.
Can be slow: ~ 50ms (or even slower)

Suspecting blocking in sendfile(2)
Inspected duration of sendfile system calls issued by broker process using
SystemTap (dynamic tracing tool to probe events in kernel. see my previous talk 4
)
$ stap -e ‘(script counting sendfile(2) duration histogram)’
# value (us)
value |---------------------------------------- count
0 | 0
1 | 71
2 |@@@ 6171
16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472
32 |@@@ 3418
2048 | 0
...
8192 | 3
4
https://www.confluent.io/kafka-summit-sf17/One-Day-One-Data-Hub-100-Billion-Messages-
Kafka-at-LINE

Problem hypothesis
Fetch request reading old data causes blocking sendﬁle(2) in event loop and applying latency for
responses needs to be processed in the same network thread.

Problem hypothesis
Super harmful because of:
It can be triggered either by:
- Consumers attempting to fetch old data
- Replica fetch by follower brokers for restoring replica
of old logs
=> Both are very common scenario
Breaks performance isolation among independent
clients.

Solution candidates
A: Separate network threads among clients
=> Possible, but a lot of changes required
=> Not essential because network threads should be
completely computation intensive
B: Balance connections among network threads
=> Possible, but again a lot of changes
=> Still for ﬁrst moment other connections will get
affected

Solution candidates
C: Make sure that data are ready on memory before the
response passed to the network thread
=> Event loop never blocks

Choice: Warmup page cache
before the network thread
Move blocking part to request
handler threads (= single queue
and pool of threads)
=> Free thread can take arbitrary
task (request) while some
threads are blocked.

Choice: Warmup page cache
before the network thread
When Network thread calls
sendﬁle(2) for transferring log
data, it's always in page cache.

Warming up page cache with minimal overhead
Easiest way: Do synchronous read(2) on target data
=> Large overhead by copying memory from kernel to
userland
Why is Kafka using sendﬁle(2) for transferring topic data?
=> To avoid expensive large memory copy
How can we achieve it keeping this property?

Trick #1 Zero copy synchronous
page load
Call sendﬁle(2) for target data
with dest /dev/null.
The /dev/null driver does not
actually copy data to anywhere.

Why it has almost no overhead?
Linux kernel internally uses splice to implement sendfile(2).
splice implementation of /dev/null returns w/o iterating target data.
# ./drivers/char/mem.c
static const struct file_operations null_fops = {
...
.splice_write = splice_write_null,
};
static int pipe_to_null(...)
{
return sd->len;
}
static ssize_t splice_write_null(...)
{
return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_null);
}

Implementing page load
// FileRecords.java
private static final java.nio.file.Path DEVNULL_PATH =
new File("/dev/null").toPath();
public void prepareForRead() throws IOException {
long size = Math.min(channel.size(), end) - start;
try (FileChannel devnullChannel = FileChannel.open(
DEVNULL_PATH, StandardOpenOption.WRITE)) {
// Calls sendfile(2) internally
channel.transferTo(start, size, devnullChannel);
}
}

Trick #2 Skip the "hot" last log
segment
Another concern: additional
syscalls * Fetch req count?
- Warming up is necessary only
for older data.
- Exclude the last log segment
from the warmup target.

Trick #2 Skip the "hot" last log segment
# Log.scala#read
@@ -585,6 +586,17 @@ class Log(@volatile var dir: File,
if(fetchInfo == null) {
entry = segments.higherEntry(entry.getKey)
} else {
+ // For last entries we assume that it is hot enough to still have all data in page cache.
+ // Most of fetch requests are fetching from the tail of the log, so this optimization
+ // should save call of sendfile significantly.
+ if (!isLastEntry && fetchInfo.records.isInstanceOf[FileRecords]) {
+ try {
+ info("Prepare Read for " + fetchInfo.records.asInstanceOf[FileRecords].file().getPath)
+ fetchInfo.records.asInstanceOf[FileRecords].prepareForRead()
+ } catch {
+ case e: Throwable => warn("failed to prepare cache for read", e)
+ }
+ }
return fetchInfo
}

It works
No response time degradation in irrelevant requests while there are coincidence of Fetch request
triggers disk read.

Patch upstream?
Concern: The patch heavily assumes underlying kernel
implementation.
Still:
- Effect is tremendous.
- Fixes very common performance degradation scenario.
Discuss at KAFKA-7504

Conclusion
— Talked requirements for multi tenancy clusters and
solutions
— Quota, Metrics, Slowlog ... and hacky patch.
— After ﬁxing some issues our hosting policy is working well
and efﬁcient, keeping:
— concept of single "Data Hub" and
— operational cost not proportional to the number of
users/usages.
— Kafka is well designed and implemented to contain many,
independent and different workloads.

End of presentation.
Questions?

Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE

Similar to Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE (20)

Recently uploaded

Recently uploaded (20)

Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE