Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multitenancy: Kafka clusters for everyone at LINE

1,666 views

Published on

Kafka Meetup 2018.04 https://kafka-apache-jp.connpass.com/event/80479/

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Multitenancy: Kafka clusters for everyone at LINE

  1. 1. Multitenancy: Kafka clusters for everyone at LINE Yuto Kawamura - LINE Corporation
  2. 2. Speaker introduction — Yuto Kawamura — Senior Software Engineer at LINE — Leading project to redesign microservices architecture w/ Kafka — Apache Kafka Contributor — Speaker at Kafka Summit SF 2017 — Also at Kafka Meetup #3
  3. 3. Outline — Kafka at LINE as of today (2018.04) — Challenges on multitenancy — Engineering for achieving multitenancy
  4. 4. Kafka at LINE as of today (2018.04)
  5. 5. We have more clusters — Added more clusters since last year to support: — Different DCs — Security sensitive data w/ SASL+TLS — They are separated by "purposes" but not by "users" ; our multitenancy strategy — Fewer clusters allow us to concentrate our engineering resources for maximizing their performance — They're concepturally the "Data Hub" too
  6. 6. One cluster has many users — Topics: — 100 ~ 400+ per cluster — Users: — few ~ tens per cluster — Messages: 150 billion messages / day in largest cluster — 3~ million / sec on peak — None of messages are supposed to lost because all usages are somehow related to service
  7. 7. Challenges on multitenancy
  8. 8. For doing multitenancy, we have to ensure: — Certain level of isolation among client workloads — Cluster is abusing-client proof — Can track on which client sending particular request — We have to be confident about what we do to say "don't worry" for people saying "we want a dedicated cluster only for us!"
  9. 9. Engineering for achieving multitenancy
  10. 10. Request Quota — It's more important to manage number of requests over incoming/outgoing byte rate — Kafka is amazingly strong at handling large data if they are well-batched — => For consumers responses are naturally batched — => Main danger exists on Producers which configures linger.ms=0 — Starting from 0.11.0.0, by KIP-124 we can configure request rate quota 2 2 https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas
  11. 11. Request Quota — Manage master of cluster config in YAML inside Ansible repository — Apply all at once during cluster provisioning by kafka_config ansible module (developed internally) — Can tell latest config on cluster w/o quierying cluster, can keep change history on git --- kafka_cluster_configs: - entity_type: clients configs: request_percentage: 40 producer_byte_rate: 1073741824 - entity_type: clients entity_name: foobar-producer configs: request_percentage: 200
  12. 12. Slowlog — Log requests which took longer than certain threshold to process — Kafka has "request logging" but it leads too many of lines — Inspired by HBase's # RequestChannel.scala#updateRequestMetrics + slowLogThresholdMap.get(metricNames.head).filter(_ >= 0).filter { v => + val targetTime = requestId match { + case ApiKeys.FETCH.id => totalTime - apiRemoteTime + case _ => totalTime + } + + targetTime >= v + }.foreach { _ => + requestLogger.warn("Slow response:%s from connection %s;totalTime:%d... + .format(requestDesc(true), connectionId, totalTime, requestQueueTime... + } [2016-12-26 16:04:20,135] WARN Slow response:Name: FetchRequest; Version: 2 ... ;totalTime:1817;localTime: ...
  13. 13. Slowlog — Thresholds can be changed dynamically through JMX console for each request type
  14. 14. The disk read by delayed consumer problem — Detection: 50x ~ 100x slower response time in 99th %ile Produce response time — Disk read of certain amount — Network threads' utilization was very high
  15. 15. Suspecting sendfile is taking long... — Because: 1. disk read was occuring at that time, 2. network threads' utilization was high $ stap -e ‘(script counting sendfile(2) duration histogram)’ value |---------------------------------------- count 0 | 0 1 | 71 2 |@@@ 6171 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472 32 |@@@ 3418 2048 | 0 ... 8192 | 3 — Normal: 2 ~ 32us — Outliers: 8ms ~ — (About SystemTap, see my previous presentation 3 ) 3 https://www.slideshare.net/kawamuray/kafka-meetup-jp-3-engineering-apache-kafka-at-line
  16. 16. Kafka broker's thread model — Network threads (controlled by num.network.threads) takes read/write of requests/ responses — Network threads hold established connections exclusively for event-driven IO — Request handler threads (controlled by num.io.threads) takes request processing and IO between block device except sendfile(2) for Fetch requests
  17. 17. When Fetch request for data that doesn't present in page cache occurs...
  18. 18. Problem definition — network threads contains potentially-blocking ops while it's supposed to work as event loop — and we have no way to know if upcoming sendfile(2) blocks awaiting disk read or not
  19. 19. It was the one of the worst issues we had because of: — Completely breaks resource isolation among all clients including producers — Occurs naturally when one of consumers slows down — Have to communicate with users every time to ask for fix — Occurs 100% when one broker restores log data from leader
  20. 20. Solution candidates — A: Separate network threads among clients — => Possible, but a lot of changes required — => Not essential because network threads should be completely computation intensive — B: Balance connections among network threads — => Possible, but again a lot of changes — => Still for first moment other connections will get affected — C: Make sure that data are ready on memory before the response passed to the network thread
  21. 21. To make sure non-blocking sendfile(2) in network threads... — The target data must be available on page cache
  22. 22. How? NAME sendfile - transfer data between file descriptors SYNOPSIS #include <sys/sendfile.h> ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); — sendfile(2) on Linux doesn't accepts flags for controlling it's behavior — Interestingly FreeBSD has such, by contribution from nginx and Netflix 1 1 https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/
  23. 23. So have to; 1. Pre-read data not available on page cache from disks, 2. and confirm the page's existence before passing response to network threads
  24. 24. sendfile(2) to the dest /dev/null — Calling channel.transferTo("/dev/null") (== sendfile(/ dev/null)) in request handler thread might populates page cache? — Tested out, and figured out there's no noticeable performance impact
  25. 25. How could it be that harmless? — Linux kernel internally uses splice to implement sendfile(2) — splice requests struct file_operations to handle splice — struct file_operations null_fops just iterates list of page pointers but not each bytes — => Iteration count is SIZE / PAGE_SIZE(4k) # ./drivers/char/mem.c static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf, struct splice_desc *sd) { return sd->len; } static ssize_t splice_write_null(struct pipe_inode_info *pipe,struct file *out, loff_t *ppos, size_t len, unsigned int flags) { return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_null); }
  26. 26. Patching broker to call sendfile(/dev/null) in request handler threads # FileRecords.java @SuppressWarnings("UnnecessaryFullyQualifiedName") private static final java.nio.file.Path DEVNULL_PATH = new File("/dev/null").toPath(); public void prepareForRead() throws IOException { long size = Math.min(channel.size(), end) - start; try (FileChannel devnullChannel = FileChannel.open(DEVNULL_PATH, java.nio.file.StandardOpenOption.WRITE)) { channel.transferTo(start, size, devnullChannel); } } — Still not fully-portable because it assumes underlying kernel's implementation detail (so we haven't contributed...)
  27. 27. ... and more to minimize impact of increased syscall... # Log.scala#read @@ -585,6 +586,17 @@ class Log(@volatile var dir: File, if(fetchInfo == null) { entry = segments.higherEntry(entry.getKey) } else { + // For last entries we assume that it is hot enough to still have all data in page cache. + // Most of fetch requests are fetching from the tail of the log, so this optimization + // should save call of readahead() + mmap() + mincore() * N significantly. + if (!isLastEntry && fetchInfo.records.isInstanceOf[FileRecords]) { + try { + info("Prepare Read for " + fetchInfo.records.asInstanceOf[FileRecords].file().getPath) + fetchInfo.records.asInstanceOf[FileRecords].prepareForRead() + } catch { + case e: Throwable => warn("failed to prepare cache for read", e) + } + } return fetchInfo } — Perform cache warmup only if the read segment IS NOT the latest — => can save unnecessary syscalls for 99% of Fetch requests
  28. 28. A!er all
  29. 29. Conclusion — Having fewer clusters enables us to concentriate on reliability engineering and essential troubleshootings/fixes — Preventive engineering enables us to keep operating Kafka clusters in highest reliability even under high and inexplicable load — We've had some failures in development cluster, but never in production cluster — The important in operating on-premise multitenancy; not necessary to prevent 100% of failure, but never let the same hole to be punched again
  30. 30. End of presentation. Questions?

×