SlideShare a Scribd company logo
1 of 30
Download to read offline
Multitenancy: Kafka clusters
for everyone at LINE
Yuto Kawamura - LINE Corporation
Speaker introduction
— Yuto Kawamura
— Senior Software Engineer at
LINE
— Leading project to redesign
microservices architecture
w/ Kafka
— Apache Kafka Contributor
— Speaker at Kafka Summit SF
2017
— Also at Kafka Meetup #3
Outline
— Kafka at LINE as of today (2018.04)
— Challenges on multitenancy
— Engineering for achieving multitenancy
Kafka at LINE as of today
(2018.04)
We have more clusters
— Added more clusters since last
year to support:
— Different DCs
— Security sensitive data w/
SASL+TLS
— They are separated by "purposes"
but not by "users" ; our
multitenancy strategy
— Fewer clusters allow us to
concentrate our engineering
resources for maximizing their
performance
— They're concepturally the "Data
Hub" too
One cluster has many users
— Topics:
— 100 ~ 400+ per cluster
— Users:
— few ~ tens per cluster
— Messages: 150 billion messages / day in largest cluster
— 3~ million / sec on peak
— None of messages are supposed to lost because all
usages are somehow related to service
Challenges on multitenancy
For doing multitenancy, we have to ensure:
— Certain level of isolation among client workloads
— Cluster is abusing-client proof
— Can track on which client sending particular request
— We have to be confident about what we do to say
"don't worry" for people saying "we want a dedicated
cluster only for us!"
Engineering for achieving
multitenancy
Request Quota
— It's more important to manage number of requests over
incoming/outgoing byte rate
— Kafka is amazingly strong at handling large data if they
are well-batched
— => For consumers responses are naturally batched
— => Main danger exists on Producers which configures
linger.ms=0
— Starting from 0.11.0.0, by KIP-124 we can configure request
rate quota 2
2
https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas
Request Quota
— Manage master of cluster config in YAML inside Ansible repository
— Apply all at once during cluster provisioning by kafka_config ansible module
(developed internally)
— Can tell latest config on cluster w/o quierying cluster, can keep change history
on git
---
kafka_cluster_configs:
- entity_type: clients
configs:
request_percentage: 40
producer_byte_rate: 1073741824
- entity_type: clients
entity_name: foobar-producer
configs:
request_percentage: 200
Slowlog
— Log requests which took longer than certain threshold to process
— Kafka has "request logging" but it leads too many of lines
— Inspired by HBase's
# RequestChannel.scala#updateRequestMetrics
+ slowLogThresholdMap.get(metricNames.head).filter(_ >= 0).filter { v =>
+ val targetTime = requestId match {
+ case ApiKeys.FETCH.id => totalTime - apiRemoteTime
+ case _ => totalTime
+ }
+
+ targetTime >= v
+ }.foreach { _ =>
+ requestLogger.warn("Slow response:%s from connection %s;totalTime:%d...
+ .format(requestDesc(true), connectionId, totalTime, requestQueueTime...
+ }
[2016-12-26 16:04:20,135] WARN Slow response:Name: FetchRequest;
Version: 2 ... ;totalTime:1817;localTime: ...
Slowlog
— Thresholds can be changed dynamically through JMX console for each request type
The disk read by delayed consumer problem
— Detection: 50x ~ 100x slower response time in 99th %ile Produce response time
— Disk read of certain amount
— Network threads' utilization was very high
Suspecting sendfile is taking long...
— Because: 1. disk read was occuring at that time, 2. network threads' utilization was high
$ stap -e ‘(script counting sendfile(2) duration histogram)’
value |---------------------------------------- count
0 | 0
1 | 71
2 |@@@ 6171
16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472
32 |@@@ 3418
2048 | 0
...
8192 | 3
— Normal: 2 ~ 32us
— Outliers: 8ms ~
— (About SystemTap, see my previous presentation 3
)
3
https://www.slideshare.net/kawamuray/kafka-meetup-jp-3-engineering-apache-kafka-at-line
Kafka broker's thread model
— Network threads (controlled by
num.network.threads) takes
read/write of requests/
responses
— Network threads hold
established connections
exclusively for event-driven IO
— Request handler threads
(controlled by num.io.threads)
takes request processing and
IO between block device
except sendfile(2) for Fetch
requests
When Fetch request for data that doesn't present in
page cache occurs...
Problem definition
— network threads contains potentially-blocking ops
while it's supposed to work as event loop
— and we have no way to know if upcoming sendfile(2)
blocks awaiting disk read or not
It was the one of the worst issues we had because of:
— Completely breaks resource isolation among all
clients including producers
— Occurs naturally when one of consumers slows down
— Have to communicate with users every time to ask for
fix
— Occurs 100% when one broker restores log data from
leader
Solution candidates
— A: Separate network threads among clients
— => Possible, but a lot of changes required
— => Not essential because network threads should be
completely computation intensive
— B: Balance connections among network threads
— => Possible, but again a lot of changes
— => Still for first moment other connections will get
affected
— C: Make sure that data are ready on memory before the
response passed to the network thread
To make sure non-blocking sendfile(2) in network
threads...
— The target data must be available on page cache
How?
NAME
sendfile - transfer data between file descriptors
SYNOPSIS
#include <sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
— sendfile(2) on Linux doesn't accepts flags for controlling
it's behavior
— Interestingly FreeBSD has such, by contribution from nginx
and Netflix 1
1
https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/
So have to;
1. Pre-read data not available on page cache from disks,
2. and confirm the page's existence before passing
response to network threads
sendfile(2) to the dest /dev/null
— Calling channel.transferTo("/dev/null") (== sendfile(/
dev/null)) in request handler thread might populates
page cache?
— Tested out, and figured out there's no noticeable
performance impact
How could it be that harmless?
— Linux kernel internally uses splice to implement sendfile(2)
— splice requests struct file_operations to handle splice
— struct file_operations null_fops just iterates list of page pointers but not each
bytes
— => Iteration count is SIZE / PAGE_SIZE(4k)
# ./drivers/char/mem.c
static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf,
struct splice_desc *sd)
{
return sd->len;
}
static ssize_t splice_write_null(struct pipe_inode_info *pipe,struct file *out,
loff_t *ppos, size_t len, unsigned int flags)
{
return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_null);
}
Patching broker to call sendfile(/dev/null) in request
handler threads
# FileRecords.java
@SuppressWarnings("UnnecessaryFullyQualifiedName")
private static final java.nio.file.Path DEVNULL_PATH = new File("/dev/null").toPath();
public void prepareForRead() throws IOException {
long size = Math.min(channel.size(), end) - start;
try (FileChannel devnullChannel = FileChannel.open(DEVNULL_PATH,
java.nio.file.StandardOpenOption.WRITE)) {
channel.transferTo(start, size, devnullChannel);
}
}
— Still not fully-portable because it assumes underlying
kernel's implementation detail (so we haven't
contributed...)
... and more to minimize impact of increased syscall...
# Log.scala#read
@@ -585,6 +586,17 @@ class Log(@volatile var dir: File,
if(fetchInfo == null) {
entry = segments.higherEntry(entry.getKey)
} else {
+ // For last entries we assume that it is hot enough to still have all data in page cache.
+ // Most of fetch requests are fetching from the tail of the log, so this optimization
+ // should save call of readahead() + mmap() + mincore() * N significantly.
+ if (!isLastEntry && fetchInfo.records.isInstanceOf[FileRecords]) {
+ try {
+ info("Prepare Read for " + fetchInfo.records.asInstanceOf[FileRecords].file().getPath)
+ fetchInfo.records.asInstanceOf[FileRecords].prepareForRead()
+ } catch {
+ case e: Throwable => warn("failed to prepare cache for read", e)
+ }
+ }
return fetchInfo
}
— Perform cache warmup only if the read segment IS NOT the latest
— => can save unnecessary syscalls for 99% of Fetch requests
A!er all
Conclusion
— Having fewer clusters enables us to concentriate on
reliability engineering and essential troubleshootings/fixes
— Preventive engineering enables us to keep operating
Kafka clusters in highest reliability even under high and
inexplicable load
— We've had some failures in development cluster, but
never in production cluster
— The important in operating on-premise multitenancy; not
necessary to prevent 100% of failure, but never let the
same hole to be punched again
End of presentation.
Questions?

More Related Content

What's hot

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 

What's hot (20)

Kafka Connect
Kafka ConnectKafka Connect
Kafka Connect
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insights
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 
Large scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_NozomiLarge scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_Nozomi
 

Similar to Multitenancy: Kafka clusters for everyone at LINE

Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
confluent
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's Bagheera
Varunkumar Manohar
 

Similar to Multitenancy: Kafka clusters for everyone at LINE (20)

Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Kafka zero to hero
Kafka zero to heroKafka zero to hero
Kafka zero to hero
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to heroApache Kafka - From zero to hero
Apache Kafka - From zero to hero
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's Bagheera
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of Seasons
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Cl116
Cl116Cl116
Cl116
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 

Multitenancy: Kafka clusters for everyone at LINE

  • 1. Multitenancy: Kafka clusters for everyone at LINE Yuto Kawamura - LINE Corporation
  • 2. Speaker introduction — Yuto Kawamura — Senior Software Engineer at LINE — Leading project to redesign microservices architecture w/ Kafka — Apache Kafka Contributor — Speaker at Kafka Summit SF 2017 — Also at Kafka Meetup #3
  • 3. Outline — Kafka at LINE as of today (2018.04) — Challenges on multitenancy — Engineering for achieving multitenancy
  • 4. Kafka at LINE as of today (2018.04)
  • 5. We have more clusters — Added more clusters since last year to support: — Different DCs — Security sensitive data w/ SASL+TLS — They are separated by "purposes" but not by "users" ; our multitenancy strategy — Fewer clusters allow us to concentrate our engineering resources for maximizing their performance — They're concepturally the "Data Hub" too
  • 6. One cluster has many users — Topics: — 100 ~ 400+ per cluster — Users: — few ~ tens per cluster — Messages: 150 billion messages / day in largest cluster — 3~ million / sec on peak — None of messages are supposed to lost because all usages are somehow related to service
  • 8. For doing multitenancy, we have to ensure: — Certain level of isolation among client workloads — Cluster is abusing-client proof — Can track on which client sending particular request — We have to be confident about what we do to say "don't worry" for people saying "we want a dedicated cluster only for us!"
  • 10. Request Quota — It's more important to manage number of requests over incoming/outgoing byte rate — Kafka is amazingly strong at handling large data if they are well-batched — => For consumers responses are naturally batched — => Main danger exists on Producers which configures linger.ms=0 — Starting from 0.11.0.0, by KIP-124 we can configure request rate quota 2 2 https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas
  • 11. Request Quota — Manage master of cluster config in YAML inside Ansible repository — Apply all at once during cluster provisioning by kafka_config ansible module (developed internally) — Can tell latest config on cluster w/o quierying cluster, can keep change history on git --- kafka_cluster_configs: - entity_type: clients configs: request_percentage: 40 producer_byte_rate: 1073741824 - entity_type: clients entity_name: foobar-producer configs: request_percentage: 200
  • 12. Slowlog — Log requests which took longer than certain threshold to process — Kafka has "request logging" but it leads too many of lines — Inspired by HBase's # RequestChannel.scala#updateRequestMetrics + slowLogThresholdMap.get(metricNames.head).filter(_ >= 0).filter { v => + val targetTime = requestId match { + case ApiKeys.FETCH.id => totalTime - apiRemoteTime + case _ => totalTime + } + + targetTime >= v + }.foreach { _ => + requestLogger.warn("Slow response:%s from connection %s;totalTime:%d... + .format(requestDesc(true), connectionId, totalTime, requestQueueTime... + } [2016-12-26 16:04:20,135] WARN Slow response:Name: FetchRequest; Version: 2 ... ;totalTime:1817;localTime: ...
  • 13. Slowlog — Thresholds can be changed dynamically through JMX console for each request type
  • 14. The disk read by delayed consumer problem — Detection: 50x ~ 100x slower response time in 99th %ile Produce response time — Disk read of certain amount — Network threads' utilization was very high
  • 15. Suspecting sendfile is taking long... — Because: 1. disk read was occuring at that time, 2. network threads' utilization was high $ stap -e ‘(script counting sendfile(2) duration histogram)’ value |---------------------------------------- count 0 | 0 1 | 71 2 |@@@ 6171 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472 32 |@@@ 3418 2048 | 0 ... 8192 | 3 — Normal: 2 ~ 32us — Outliers: 8ms ~ — (About SystemTap, see my previous presentation 3 ) 3 https://www.slideshare.net/kawamuray/kafka-meetup-jp-3-engineering-apache-kafka-at-line
  • 16. Kafka broker's thread model — Network threads (controlled by num.network.threads) takes read/write of requests/ responses — Network threads hold established connections exclusively for event-driven IO — Request handler threads (controlled by num.io.threads) takes request processing and IO between block device except sendfile(2) for Fetch requests
  • 17. When Fetch request for data that doesn't present in page cache occurs...
  • 18. Problem definition — network threads contains potentially-blocking ops while it's supposed to work as event loop — and we have no way to know if upcoming sendfile(2) blocks awaiting disk read or not
  • 19. It was the one of the worst issues we had because of: — Completely breaks resource isolation among all clients including producers — Occurs naturally when one of consumers slows down — Have to communicate with users every time to ask for fix — Occurs 100% when one broker restores log data from leader
  • 20. Solution candidates — A: Separate network threads among clients — => Possible, but a lot of changes required — => Not essential because network threads should be completely computation intensive — B: Balance connections among network threads — => Possible, but again a lot of changes — => Still for first moment other connections will get affected — C: Make sure that data are ready on memory before the response passed to the network thread
  • 21. To make sure non-blocking sendfile(2) in network threads... — The target data must be available on page cache
  • 22. How? NAME sendfile - transfer data between file descriptors SYNOPSIS #include <sys/sendfile.h> ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); — sendfile(2) on Linux doesn't accepts flags for controlling it's behavior — Interestingly FreeBSD has such, by contribution from nginx and Netflix 1 1 https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/
  • 23. So have to; 1. Pre-read data not available on page cache from disks, 2. and confirm the page's existence before passing response to network threads
  • 24. sendfile(2) to the dest /dev/null — Calling channel.transferTo("/dev/null") (== sendfile(/ dev/null)) in request handler thread might populates page cache? — Tested out, and figured out there's no noticeable performance impact
  • 25. How could it be that harmless? — Linux kernel internally uses splice to implement sendfile(2) — splice requests struct file_operations to handle splice — struct file_operations null_fops just iterates list of page pointers but not each bytes — => Iteration count is SIZE / PAGE_SIZE(4k) # ./drivers/char/mem.c static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf, struct splice_desc *sd) { return sd->len; } static ssize_t splice_write_null(struct pipe_inode_info *pipe,struct file *out, loff_t *ppos, size_t len, unsigned int flags) { return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_null); }
  • 26. Patching broker to call sendfile(/dev/null) in request handler threads # FileRecords.java @SuppressWarnings("UnnecessaryFullyQualifiedName") private static final java.nio.file.Path DEVNULL_PATH = new File("/dev/null").toPath(); public void prepareForRead() throws IOException { long size = Math.min(channel.size(), end) - start; try (FileChannel devnullChannel = FileChannel.open(DEVNULL_PATH, java.nio.file.StandardOpenOption.WRITE)) { channel.transferTo(start, size, devnullChannel); } } — Still not fully-portable because it assumes underlying kernel's implementation detail (so we haven't contributed...)
  • 27. ... and more to minimize impact of increased syscall... # Log.scala#read @@ -585,6 +586,17 @@ class Log(@volatile var dir: File, if(fetchInfo == null) { entry = segments.higherEntry(entry.getKey) } else { + // For last entries we assume that it is hot enough to still have all data in page cache. + // Most of fetch requests are fetching from the tail of the log, so this optimization + // should save call of readahead() + mmap() + mincore() * N significantly. + if (!isLastEntry && fetchInfo.records.isInstanceOf[FileRecords]) { + try { + info("Prepare Read for " + fetchInfo.records.asInstanceOf[FileRecords].file().getPath) + fetchInfo.records.asInstanceOf[FileRecords].prepareForRead() + } catch { + case e: Throwable => warn("failed to prepare cache for read", e) + } + } return fetchInfo } — Perform cache warmup only if the read segment IS NOT the latest — => can save unnecessary syscalls for 99% of Fetch requests
  • 29. Conclusion — Having fewer clusters enables us to concentriate on reliability engineering and essential troubleshootings/fixes — Preventive engineering enables us to keep operating Kafka clusters in highest reliability even under high and inexplicable load — We've had some failures in development cluster, but never in production cluster — The important in operating on-premise multitenancy; not necessary to prevent 100% of failure, but never let the same hole to be punched again