Kafka at Scale: Multi-Tier Architectures

•Download as PPTX, PDF•

43 likes•13,462 views

This is a talk given at ApacheCon 2015 If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community. Note - there are a significant amount of slide notes on each slide that goes into detail. Please make sure to check out the downloaded file to get the full content!

Data & Analytics

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Kafka at Scale

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Todd Palino

3
You may remember me from such
talks as…
“Apache Kafka Meetup”
And
“Enterprise Kafka: QoS and Multitenancy”

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Who Am I?
 Kafka, Samza, and Zookeeper SRE at LinkedIn
 Site Reliability Engineering
– Administrators
– Architects
– Developers
 Keep the site running, always
4

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
What Will We Talk About?
 Tiered Cluster Architecture
 Kafka Mirror Maker
 Performance Tuning
 Data Assurance
 What’s Next?
5

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 300+ Kafka brokers
 Over 18,000 topics
 140,000+ Partitions
 220 Billion messages per day
 40 Terabytes In
 160 Terabytes Out
 Peak Load
– 3.25 Million messages/sec
– 5.5 Gigabits/sec Inbound
– 18 Gigabits/sec Outbound
6
 1100+ Kafka brokers
 Over 32,000 topics
 350,000+ Partitions
 875 Billion messages per day
 185 Terabytes In
 675 Terabytes Out
 Peak Load
– 10.5 Million messages/sec
– 18.5 Gigabits/sec Inbound
– 70.5 Gigabits/sec Outbound

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Tiered Cluster Architecture
7

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
One Kafka Cluster
8

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Single Cluster – Remote Clients
9

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Multiple Clusters – Local and Remote Clients
10

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Multiple Clusters – Message Aggregation
11

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Why Not Direct?
 Network Concerns
– Bandwidth
– Network partitioning
– Latency
 Security Concerns
– Firewalls and ACLs
– Encrypting data in transit
 Resource Concerns
– A misbehaving application can swamp production resources
12

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Kafka Mirror Maker
13

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Kafka Mirror Maker
 Consumes from one cluster, produces to another
 No communication from producer back to consumer
 Best practice is to keep the mirror maker local to the target cluster
 Kafka does not prevent loops
14

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Rules of Aggregation
 NEVER produce to aggregate clusters
15

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
NEVER produce to
aggregate clusters!
16

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Rules of Aggregation
 NEVER produce to aggregate clusters
 Not every topic needs to be aggregated
– Log compacted topics do not play nice
– Most queuing topics are local only
 But your whitelist/blacklist configurations must be consistent
– If you have a topic that is aggregated, make sure to do it from all source
clusters to all aggregate clusters
 Carefully consider if you want front-line aggregate clusters
– It can encourage creating single-master services
– Sometimes it is necessary, such as for search services
17

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Mirror Maker Concerns
 Adding a site increases the number of mirror maker instances
– Solution: Multi-consumer mirror makers
 Mirror maker can lose messages like any producer
– Solution: reduce inflight batches and acks=-1
 Mirror maker has to decompress and recompress every batch
– Possible solution: flag compressed batches for keyed messages
 Message partitions are not preserved
– Possible solution: an identity mirror maker
18

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Performance Tuning
19

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Kafka Cluster Sizing
 How big for your local cluster?
– How much disk space do you have?
– How much network bandwidth do you have?
– CPU, memory, disk I/O
 How big for your aggregate cluster?
– In general, multiple the number of brokers by the number of local clusters
– May have additional concerns with lots of consumers
20

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Topic Configuration
 Partition Counts for Local
– Many theories on how to do this correctly, but the answer is “it depends”
– How many consumers do you have?
– Do you have specific partition requirements?
– Keeping partition sizes manageable
 Partition Counts for Aggregate
– Multiply the number of partitions in a local cluster by the number of local clusters
– Periodically review partition counts in all clusters
 Message Retention
– If aggregate is where you really need the messages, only retain it in local for long
enough to cover mirror maker problems
21

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Mirror Maker Sizing
 Number of servers and streams
– Size the number of servers based on the peak bytes per second
– Co-locate mirror makers
– Run more mirror makers in an instance than you need
– Use multiple consumer and producer streams
 Other tunables to look at
– Partition assignment strategy
– In flight requests per connection
– Linger time
22

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Segregation of Topics
 Not all topics are created equal
 High Priority Topics
– Topics that change search results
– Topics used for hourly or daily reporting
 Run a separate mirror maker for these topics
– One bloated topic won’t affect reporting
– Restarting the mirror maker takes less time
– Less time to catch up when you fall behind
23

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Data Assurance
24

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Monitoring
 Kafka is great for monitoring your applications
25

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Monitoring
 Have a system for monitoring Kafka components that does not use Kafka
– At least for critical metrics
 For tiered architectures
– Simple health check on mirror maker instances
– Mirror maker consumer lag
 Is the data intact?
26

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Auditing Message Flows
27

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Audit Content
 Message audit header
– Timestamp
– Service and hostname
 Audit messages
– Start and end timestamps
– Topic and tier
– Count
28

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Audit Concerns
 We are only counting messages
– Duplication of messages can hide losses
– Using the detailed service and host audit criteria, we can get around this
 We can’t audit all consumers
– The relational DB has issues keeping up with bootstrapping clients
– This can be improved with changes to the database backend
 We cannot handle complex message flows
– The total number of messages has to appear in each tier that the topic is in
– Multiple source clusters must have the same tier name
29

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Conclusion
30

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Work Needed in Kafka
 Access controls
 Encryption
 Quotas
 Decompression improvements in mirror maker
31

SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Getting Involved With Kafka
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
– dev@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Meetups
– Apache Kafka - http://www.meetup.com/http-kafka-apache-org
– Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/
 Contribute code
32

Kafka at Scale: Multi-Tier Architectures

What's hot

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward

KSQL Introconfluent

Introduction to Apache KafkaJeff Holoman

A Deep Dive into Kafka Controllerconfluent

Performance Monitoring: Understanding Your Scylla ClusterScyllaDB

CockroachDBandrei moga

Kafka 101 and Developer Best Practicesconfluent

Microservices Docker Kubernetes Istio Kanban DevOps SREAraf Karsh Hamid

ksqlDB - Stream Processing simplified!Guido Schmutz

Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...HostedbyConfluent

Fundamentals of Apache KafkaChhavi Parasher

Tuning kafka pipelinesSumant Tambe

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Apache Kafka 0.8 basic training - VerisignMichael Noll

Microservices Architecture - Bangkok 2018Araf Karsh Hamid

CNCF Keynote - What is cloud native?Weaveworks

[Outdated] Secrets of Performance Tuning Java on KubernetesBruno Borges

Scalability, Availability & Stability PatternsJonas Bonér

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

What's hot (20)

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

KSQL Intro

Introduction to Apache Kafka

A Deep Dive into Kafka Controller

Performance Monitoring: Understanding Your Scylla Cluster

CockroachDB

Kafka 101 and Developer Best Practices

Microservices Docker Kubernetes Istio Kanban DevOps SRE

ksqlDB - Stream Processing simplified!

Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...

Fundamentals of Apache Kafka

Tuning kafka pipelines

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Apache Kafka 0.8 basic training - Verisign

Microservices Architecture - Bangkok 2018

CNCF Keynote - What is cloud native?

[Outdated] Secrets of Performance Tuning Java on Kubernetes

Scalability, Availability & Stability Patterns

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Viewers also liked

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...DataWorks Summit/Hadoop Summit

Tuning Kafka for Fun and ProfitTodd Palino

Challenges of a multi tenant kafka serviceThomas Alex

Putting Kafka Into OverdriveTodd Palino

Microsoft challenges of a multi tenant kafka serviceNitin Kumar

Kinesis vs-kafka-and-kafka-deep-diveYifeng Jiang

Real-time streaming and data pipelines with Apache KafkaJoe Stein

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Viewers also liked (9)

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...

Tuning Kafka for Fun and Profit

Challenges of a multi tenant kafka service

Putting Kafka Into Overdrive

Microsoft challenges of a multi tenant kafka service

Kinesis vs-kafka-and-kafka-deep-dive

Real-time streaming and data pipelines with Apache Kafka

Developing Real-Time Data Pipelines with Apache Kafka

Real time Analytics with Apache Kafka and Apache Spark

Similar to Kafka at Scale: Multi-Tier Architectures

More Datacenters, More ProblemsTodd Palino

Linked in multi tier, multi-tenant, multi-problem kafkaNitin Kumar

Adobe Ask the AEM Community Expert Session Oct 2016AdobeMarketingCloud

The role of NoSQL in the Next Generation of Financial InformaticsAerospike, Inc.

Edge 2014: Maintaining the Balance: Getting the Most of Your CDN with IKEAAkamai Technologies

Increasing velocity via serless semanticsKfir Bloch

IBM API Connect Deployment `Good Practices - IBM Think 2018Chris Phillips

IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal GemfireIn-Memory Computing Summit

apidays LIVE Australia - The Evolution of APIs: Events and the AsyncAPI speci...apidays

Why and How to Monitor Application Performance in AzureRiverbed Technology

Why and How to Monitor App Performance in AzureIan Downard

Supercharging Optimizely Performance by Moving Decisions to the EdgeOptimizely

Introduction to ThousandEyesThousandEyes

MuleSoft Manchester Meetup #4 slides 11th February 2021Ieva Navickaite

Building a Modern Enterprise SOA at LinkedInJens Pillgram-Larsen

Software Factories in the Real World: How an IBM WebSphere Integration Factor...ghodgkinson

Vision2015-CBS-1148-FinalPatrick Spedding

Optimize your CI/CD with GitLab and AWSDevOps.com

Realise True Business Value .pdfThousandEyes

Enterprise Kafka: Kafka as a ServiceTodd Palino

Similar to Kafka at Scale: Multi-Tier Architectures (20)

More Datacenters, More Problems

Linked in multi tier, multi-tenant, multi-problem kafka

Adobe Ask the AEM Community Expert Session Oct 2016

The role of NoSQL in the Next Generation of Financial Informatics

Edge 2014: Maintaining the Balance: Getting the Most of Your CDN with IKEA

Increasing velocity via serless semantics

IBM API Connect Deployment `Good Practices - IBM Think 2018

IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire

apidays LIVE Australia - The Evolution of APIs: Events and the AsyncAPI speci...

Why and How to Monitor Application Performance in Azure

Why and How to Monitor App Performance in Azure

Supercharging Optimizely Performance by Moving Decisions to the Edge

Introduction to ThousandEyes

MuleSoft Manchester Meetup #4 slides 11th February 2021

Building a Modern Enterprise SOA at LinkedIn

Software Factories in the Real World: How an IBM WebSphere Integration Factor...

Vision2015-CBS-1148-Final

Optimize your CI/CD with GitLab and AWS

Realise True Business Value .pdf

Enterprise Kafka: Kafka as a Service

Recently uploaded

Easter Eggs From Star Wars and in cars 1 and 217djon017

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

detection and classification of knee osteoarthritis.pptxAleenaJamil4

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Multiple time frame trading analysis -brianshannon.pdfchwongval

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Recently uploaded (20)

Easter Eggs From Star Wars and in cars 1 and 2

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

Generative AI for Social Good at Open Data Science East 2024

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

detection and classification of knee osteoarthritis.pptx

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

Data Factory in Microsoft Fabric (MsBIP #82)

Vision, Mission, Goals and Objectives ppt..pptx

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

20240419 - Measurecamp Amsterdam - SAM.pdf

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Real-Time AI Streaming - AI Max Princeton

Identifying Appropriate Test Statistics Involving Population Mean

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Multiple time frame trading analysis -brianshannon.pdf

Defining Constituents, Data Vizzes and Telling a Data Story

Advanced Machine Learning for Business Professionals

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Student Profile Sample report on improving academic performance by uniting gr...

Kafka at Scale: Multi-Tier Architectures

3. 3 You may remember me from such talks as… “Apache Kafka Meetup” And “Enterprise Kafka: QoS and Multitenancy”

4. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Who Am I?  Kafka, Samza, and Zookeeper SRE at LinkedIn  Site Reliability Engineering – Administrators – Architects – Developers  Keep the site running, always 4

5. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Tiered Cluster Architecture  Kafka Mirror Maker  Performance Tuning  Data Assurance  What’s Next? 5

6. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  300+ Kafka brokers  Over 18,000 topics  140,000+ Partitions  220 Billion messages per day  40 Terabytes In  160 Terabytes Out  Peak Load – 3.25 Million messages/sec – 5.5 Gigabits/sec Inbound – 18 Gigabits/sec Outbound 6  1100+ Kafka brokers  Over 32,000 topics  350,000+ Partitions  875 Billion messages per day  185 Terabytes In  675 Terabytes Out  Peak Load – 10.5 Million messages/sec – 18.5 Gigabits/sec Inbound – 70.5 Gigabits/sec Outbound

12. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Why Not Direct?  Network Concerns – Bandwidth – Network partitioning – Latency  Security Concerns – Firewalls and ACLs – Encrypting data in transit  Resource Concerns – A misbehaving application can swamp production resources 12

14. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Kafka Mirror Maker  Consumes from one cluster, produces to another  No communication from producer back to consumer  Best practice is to keep the mirror maker local to the target cluster  Kafka does not prevent loops 14

17. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Rules of Aggregation  NEVER produce to aggregate clusters  Not every topic needs to be aggregated – Log compacted topics do not play nice – Most queuing topics are local only  But your whitelist/blacklist configurations must be consistent – If you have a topic that is aggregated, make sure to do it from all source clusters to all aggregate clusters  Carefully consider if you want front-line aggregate clusters – It can encourage creating single-master services – Sometimes it is necessary, such as for search services 17

18. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Mirror Maker Concerns  Adding a site increases the number of mirror maker instances – Solution: Multi-consumer mirror makers  Mirror maker can lose messages like any producer – Solution: reduce inflight batches and acks=-1  Mirror maker has to decompress and recompress every batch – Possible solution: flag compressed batches for keyed messages  Message partitions are not preserved – Possible solution: an identity mirror maker 18

20. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Kafka Cluster Sizing  How big for your local cluster? – How much disk space do you have? – How much network bandwidth do you have? – CPU, memory, disk I/O  How big for your aggregate cluster? – In general, multiple the number of brokers by the number of local clusters – May have additional concerns with lots of consumers 20

21. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Topic Configuration  Partition Counts for Local – Many theories on how to do this correctly, but the answer is “it depends” – How many consumers do you have? – Do you have specific partition requirements? – Keeping partition sizes manageable  Partition Counts for Aggregate – Multiply the number of partitions in a local cluster by the number of local clusters – Periodically review partition counts in all clusters  Message Retention – If aggregate is where you really need the messages, only retain it in local for long enough to cover mirror maker problems 21

22. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Mirror Maker Sizing  Number of servers and streams – Size the number of servers based on the peak bytes per second – Co-locate mirror makers – Run more mirror makers in an instance than you need – Use multiple consumer and producer streams  Other tunables to look at – Partition assignment strategy – In flight requests per connection – Linger time 22

23. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Segregation of Topics  Not all topics are created equal  High Priority Topics – Topics that change search results – Topics used for hourly or daily reporting  Run a separate mirror maker for these topics – One bloated topic won’t affect reporting – Restarting the mirror maker takes less time – Less time to catch up when you fall behind 23

26. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Monitoring  Have a system for monitoring Kafka components that does not use Kafka – At least for critical metrics  For tiered architectures – Simple health check on mirror maker instances – Mirror maker consumer lag  Is the data intact? 26

28. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Audit Content  Message audit header – Timestamp – Service and hostname  Audit messages – Start and end timestamps – Topic and tier – Count 28

29. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Audit Concerns  We are only counting messages – Duplication of messages can hide losses – Using the detailed service and host audit criteria, we can get around this  We can’t audit all consumers – The relational DB has issues keeping up with bootstrapping clients – This can be improved with changes to the database backend  We cannot handle complex message flows – The total number of messages has to appear in each tier that the topic is in – Multiple source clusters must have the same tier name 29

32. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Apache Kafka - http://www.meetup.com/http-kafka-apache-org – Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/  Contribute code 32

Editor's Notes

So who am I, and why am I qualified to stand up here? I am one-fourth of the Data Infrastructure Streaming SRE team at LinkedIn. We’re responsible for Kafka, Samza, and Zookeeper operations SRE stands for Site Reliability Engineering. Many of you, like myself before I started in this role, may not be familiar with the title. SRE combines several roles that fit together into one Operations position Foremost, we are administrators. We manage all of the systems in our area We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them. At the end of the day, our job is to keep the site running, always.
What are the things we are going to cover in this talk? We’ll start by talking about the basics of how Kafka works, very briefly, and move right into what a tiered architecture looks like along with the infrastructure tool we use for creating our tiers – mirror maker. I will cover performance tuning, specifically when it comes to laying out and managing tiered clusters. I’ll also talk about monitoring and other ways we assure our data gets where it is going, intact. Lastly, we’ll talk about what work is going on right now that will continue to improve the ecosystem for running large Kafka installations, and what you can do to get involved.
These are the numbers I presented this time last year, as far as how much data we push around in Kafka at LinkedIn. Over the last year, it’s changed significantly We now have well over 1100 brokers in total in our 50+ clusters Which are managing over 31,000 topics With over 350 thousand partitions between them, not including replication We’ve gone from 220 billion messages a day to over 875, and that was a slow day There is now over 185 terabytes per day flowing into Kafka, an increase of almost 4 times And consumers are reading over 675 terabytes per day out. Of course, those are both compressed data numbers At peak, we’re receiving over 10 and a half million messages per second For a total of 18 and a half gigabits per second of inbound traffic And the consumers are reading over 70 gigabits per second at the same time Again, this is compressed data. This is a fairly astonishing growth rate for the amount of data we are moving around with Kafka. Some of it comes from standing up new datacenters, so let’s move directly into what that looks like.
Let’s move right into what Kafka clusters look like, and what happens when we start organizing them into tiers.
I won’t be going into too much detail on how Kafka works. If you do not have a basic understanding of Kafka itself, I suggest checking out some of the resources listed in the Reference slides at the end of this deck. Here’s what a single Kafka cluster looks like at LinkedIn. I’ll get into some details on the TrackerProducer/TrackerConsumer components later, but they are internal libraries that wrap the open source Kafka producer and consumer components and integrate with our schema registry and our monitoring systems. Every cluster has multiple Kafka brokers, storing their metadata in a Zookeeper ensemble. We have producers sending messages in, and consumers reading messages out. At the present time, our consumers talk to Zookeeper as well and everything works well. In LinkedIn’s environment, all of these components live in the same datacenter, in the same network. What happens when you have two sites to deal with?
Multiple datacenters is where this starts to get interesting. Here is an example of a layout that uses one Kafka cluster. We’re keeping the cluster in a single datacenter, because having it span datacenters is an entirely different level of complexity. In addition, Kafka has no provision for reading from the follower brokers, so you would still be crossing datacenters with your clients. The problem with this layout should be quite obvious – if we lose Datacenter A, we’ve lost everything. Not only do we have concerns with network partitions between the datacenters cutting off access for one consumer or producer or another, we have no redundancy at all.
So to improve this situation, we’ll run a Kafka cluster in each of our primary datacenters, A and B. In this layout, C is a lower-tier datacenter where we don’t have producers of the data, only consumers. Consider it a backend environment where you run things like Hadoop. Our producers all talk to the local Kafka cluster. Consumers in the primary datacenters talk to their local cluster as well. Now if we lose either datacenter A or B, the other datacenter can continue to operate. We’ve pushed more complexity on the consumers that need to access all of the data from both datacenters, however. They have to maintain consumer connections to both clusters, and they will have to deal with networking problems that come up as a result. In addition, latency in the network connection can manifest in strange ways in an application.
Now we iterate on the architecture one more time. We add the concept of an aggregate Kafka cluster, which contains all of the messages from each of the primary datacenter local clusters. We also have a copy of this cluster in the secondary datacenter, C, for consumers there to access. We still have cross-datacenter traffic – that can’t be avoided if we need to move data around. But we have isolated it to one application, mirror maker, which we can monitor and assure works properly. This is a better situation than needing to have each consumer worry about it for themselves. We’ve definitely added complexity here, but it serves a purpose. By having the infrastructure be a little more complex, we simplify the usage of Kafka for our customers. Producers know that if they send messages to their local cluster, it will show up in the appropriate places without additional work on their part. Consumers can select which view of the data they need, and have assurances that they will see everything that is produced. The intricacies of how the data gets moved around are left to people like me, who run the Kafka infrastructure itself.
We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns. The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring. There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now. The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
So how do we move all of these messages around? That falls to the Kafka mirror maker application, which is part of the open source project.
Mirror Maker’s sole job is to copy messages from one cluster to another, and it is the glue that ties a multi-tier architecture together. It consumes messages from one cluster, and puts them on an internal queue. It then pops messages off the queue and produces them to the target cluster. Because of this architecture, there is no communication back from the producer to the consumer component. The only communication is if the queue fills up, which will cause all of the consumer streams to block. The only way this happens, however, is if the producer stops reading from the queue. If it is retrying due to a target cluster problem, it will eventually drop messages. This is why it’s a best practice to place the mirror maker in the same network as the target cluster. If a mirror maker fails to consume messages, it will just not commit offsets and you will not lose messages. You will just slow down or stop consuming. If the mirror maker fails to produce messages, then more likely than not it will start dropping messages. This means we want to keep network problems on the consumer side as much as possible. Another thing to consider is that Kafka has no way to prevent loops. If you have two mirror makers, one that copies messages from cluster A to cluster B, and another that copies in the reverse direction, they will duplicate messages if configured with the same topics. It’s a great way to fill up your disk very quickly.
There are some rules that should be followed when setting up tiers of Kafka clusters. The first is that you always produce to local clusters, never to the aggregate clusters themselves.
This is important. NEVER do this! If you produce to an aggregate tier, your aggregate tiers will be out of sync with each other. Part of the idea behind an aggregate tier is that it contains everything. If you change this assumption, you are doing something different. Don’t do it. Not even once.
Past that first rule, keep in mind that not everything needs to be in the aggregate cluster. The first part of this is that not everything can be mirrored. Log compacted topics, in particular, do not play nice with mirror maker. Queuing topics which are internal to applications most often do not need to be aggregated as well. But when you do aggregate, make sure you are consistent. If a topic shows up in your aggregate tier, it needs to be mirrored from all of the source clusters, not just a subset. If you do not do this, you break a promise to your customers that aggregate contains the entirety of what was produced to local. Lastly, consider very carefully if you actually want to have aggregate clusters in your front-end datacenters. We have them, but I often wish we didn’t. If you have the aggregate view of data in your production datacenter, you can inadvertently encourage the creation of single-master services. That is, an application which runs in only one datacenter. This happens with Kafka because there is no way to equate offsets between two different clusters, even if those clusters have the same topics and message content. Applications like this are troublesome when you want to shut down a datacenter, due to problems or maintenance. That said, there are use cases where it is hard to avoid using the aggregate view. Search services, which need to modify their indices based on what happens everywhere, are an example. But if you can move this type of processing to a second-tier network, and then copy the results back out to a front-end application, it can be worth the additional complexity.
One of the first issues with mirror maker is that as you add new local clusters, you geometrically increase the number of paths you have to mirror messages over. If each path is a mirror maker instance, this gets out of control very quickly. Thankfully, this has a simple solution – you can configure a mirror maker with multiple consumers. We actually moved away from this design early on, but are now looking at moving back since we have more sites to manage and the mirror maker application has matured quite a bit. Another big problem is that the mirror maker producer can lose messages. If it is sending messages to a cluster and there is a leadership change, it will lose batches that are in flight. This presents a big problem for clients who want to assure the delivery of their messages, as they can set acks=-1 in their application to be safer but all bets are off when there is a mirror maker involved to move the messages to an aggregate cluster. The solution here is to use a fewer number of batches inflight at any given time, and to use acks=-1 on the mirror maker producer as well. We are currently testing this configuration change so that we can provide a better end-to-end service. Mirror maker is forced to decompress every message batch upon consuming it, and then recompress it on sending it to the target cluster. The reason for this is that the mirror maker has no idea, looking at the compressed batch, if that batch contains keyed messages. If it does, the mirror maker needs to honor that and send the messages to specific partitions. As we use few keyed topics, this is a huge waste of resources. A possible solution is to flag the compressed batch as to whether or not it contains keyed messages, and only decompress it if it does. This doesn’t have a solution right now, but there’s ongoing work to address it. In addition to this, mirror maker cannot preserve the partition of a message. Within a single Kafka cluster, you can be assured that the order of messages in a partition is the order in which they were produced. You also know that for a keyed topic, a message with a specific key will always show up in the same partition. Mirror maker essentially shuffles all of the partitions again: unkeyed messages will end up mingled with messages from other partitions, and keyed messages will still all go to the same partition, but it will not necessarily be the partition they were in in the source cluster. One way to resolve this would be an “identity” mirror maker, which preserves partitioning. This isn’t a simple solution, however, because you need to take into account the case where you are coming from two local clusters with 8 partitions each into an aggregate cluster with 16 partitions. Does it fail? Interleave? Offset? The desired behavior will depend largely on the user.
More components means that we have more places to poke and prod to get the most efficiency out of our system. With multiple tiers most of this revolves around making sure the sizes of everything are correct.
It may not strictly be tuning, but having your Kafka clusters be the right size is the first place you need to start. It can be a little difficult to determine exactly how large your cluster should be, but there are a few major points you have to consider. First, how much disk space do you have on each broker, and how much space do you need to maintain message retention? One of our rules is that we keep disk usage for the log segments partition to under 60%. This allows us enough headroom to move partitions around when needed (especially because retention for the partition resets when you move it). The next concern is how much network you have. If you have gigabit network interfaces, and your Kafka cluster is going to receive 5 gigabits per second of traffic at peak, you need to take that into account. CPU, memory, and disk I/O all take a back seat to these concerns, because they mostly drive how fast your cluster operates, not if it operates at all. Size your local clusters first, then you can consider how large your aggregate cluster needs to be. For the most part, taking the number of local clusters you have, multiplied by the number of brokers in each local cluster, will give you the size of your aggregate cluster. If you have 3 local clusters with 10 brokers each, your aggregate cluster should be at least 30 brokers. You’ll also need to take into account the number of consumers of the aggregate messages. This can affect how much network bandwidth you need, which will change the number of brokers that you need.
You will also need to size your topics appropriately. This tends to be a topic of much discussion, because there are so many variables that get considered here. For example: How many brokers do you have? Do you want to perfectly balance the topic across the brokers? If so, you should have then number of partitions be a multiple of the number of brokers How many consumers does your topic have in its largest consumer group? If you have 8 partitions and 16 consumers, 8 of those consumers will be sitting idle Does your application have specific requirements around partition counts? If you are using keyed messages, you may want to go with a larger number of partitions to start with so you don’t have to expand it later based on other criteria Another concern we have is around keeping the size of a partition on disk manageable. Very large partitions can be harder to keep balanced in a cluster to make sure each broker is doing its fair share of work. We use a guideline internally of making sure partitions do not exceed 50 gigabytes on disk. When they get close or exceed that, we expand the topic (provided it is not a keyed topic). Once again here, the partition counts in your aggregate cluster should be a simple calculation. For most topics, we take the number of partitions in the local cluster, multiply it by the number of local clusters, and that is the number of partitions in the aggregate cluster. You also want to check your partition counts regularly, especially if you use automatic topic creation. An imbalance in the number of partitions between clusters can bog down mirror maker very quickly. Another thing to consider is how much retention you need. There’s nothing that says that retention has to be the same in the local and aggregate tiers. You may want to retain messages longer in aggregate, keeping them in the local tier only long enough to get them out. Just remember that this can change your sizing calculation for the aggregate clusters. If you have twice the retention, you may very well need twice the number of brokers.
We introduced another component with Mirror Maker, so we also need to size that appropriately. When we talk about sizing, we are talking about the number of copies of mirror maker with the same consumer and producer configuration, which is one pipeline. With mirror maker, it’s mostly about network throughput. Because of the decompression and recompression of message batches, you’re probably never going to run at wire speed. This means that you can easily co-locate multiple mirror makers on one set of servers to efficiently use them. You should also make sure that you are running more copies of mirror maker than you need to handle your peak traffic. If you fall behind, such as if you have a network partition for a period of time, you want to be able to catch up quickly. If you don’t have excess capacity, it will take a long time, or you will just continue to fall behind. You should also run multiple consumer and producer streams in each copy of mirror maker, as this will allow you to take advantage of the parallel nature of having multiple partitions. If you can process 15 megabytes per second at peak on one stream, you won’t get 30 with two streams, but you’ll do a lot better. We run with 8 consumer streams and 4 producer streams, and it works out pretty well. We also co-locate up to 11 mirror makers on one host, each for a separate pipeline. There are a few other parameters to consider. One is the partition assignment strategy. We asked our developers to add a round robin strategy of balancing partitions for wildcard consumers, like mirror maker. This provides a nice balance of partitions across your mirror makers, and should almost certainly be the configuration you use. You should also set the number of in flight requests per connection. A higher number will make things go faster, but it will also mean more loss of messages if mirror maker breaks. The linger time for the producer is another thing to look at. A longer linger time will allow mirror maker to assemble more efficient batches of messages, but it will also mean that messages take a little longer to get through the pipeline. Weigh which tradeoffs are the right ones for you.
Another thing you can do is to provide separate paths for different topics between the same two clusters. We do this at LinkedIn because not all topics are created equal. We have high priority topics. For you, this could be topics that change search results. For us, it’s mostly topics that are used for hourly or daily reporting. Most other topics, especially headed to Hadoop, we’re OK if they’re a little delayed. But if the hourly report to the executives is delayed, you can be sure that I’m fielding phone calls as to exactly what is broken and when it will be fixed. For these topics, we have two separate mirror maker pipelines that run in parallel. The high priority mirror maker has a small whitelist of topics, and the other mirror maker has a blacklist that contains the same topic list. This way a bloated topic that is not considered a priority will not delay the most important topics. It also means that the priority mirror maker starts up faster, and takes less time to catch up when there is a problem. These are all very good things if it means the CEO doesn’t know my name.
As the people running the Kafka infrastructure, we take on more responsibility by moving the complexity to our environment. This means that we need to be vigilant to make sure that the promises we make to our customers are kept. So we monitor the infrastructure to make sure everything is running properly, but we also need to make sure that when it is running, that it is doing the right thing. Namely, moving all the messages.
Many of us use Kafka for monitoring applications. At LinkedIn, every application and server writes metrics and logs to Kafka. We have central applications that read out these metrics and provide pretty graphs, thresholding, and other tools to make sure that everything is running properly within LinkedIn. Kafka itself is no exception, which leads to this… As soon as I say “monitoring Kafka with Kafka”, we know this is not a good thing
This means that we need to have a way of monitoring Kafka that does not rely on Kafka itself, at least for the critical metrics that tell us whether or not Kafka is working. For metrics that we look at over a longer term, such as growth metrics, it’s OK to funnel those through Kafka into the same system. But if your Kafka cluster for metrics dies, you will hear nothing but silence from your alerting system. We’ve written a monitoring system in our environment that watches the key metrics in Kafka and provides a completely separate path for thresholding and notifications. When it comes to things specific to tiered architectures, what you need to monitor is the health of the mirror maker application. You want a healthcheck to know that mirror maker is running, and you also want to monitor the consumer lag to make sure it is not falling behind. The bigger question, which basic monitoring cannot answer, is whether or not the data in your tiers is intact. Does your aggregate tier contain all of the messages it is supposed to? For this, we need a more detailed audit of the messages that are produced.
In LinkedIn’s environment, the producers of messages use an internal library called TrackerProducer. This library takes care of proper Avro encoding of messages, interfacing with a separate schema-registry for schema lookups. This library also starts the trail of audit information for messages. Every 10 minutes it produces a message into a special audit topic on the Kafka cluster with a count of how many messages were produced in the last 10 minutes. Additionally, the Kafka cluster has an audit consumer which reads all messages out of the cluster and publishes back audit topic messages with counts of how many messages were produced into each topic for each 10 minute period. Combining this with the producer audit, we can be assured that all messages that were attempted to be produced actually made it into Kafka. If the counts do not match, we know that there is a problem with one or more producers. Moving down to the aggregate cluster, there is another audit consumer instance which allows us to now compare the number of messages in the aggregate cluster to the number in the local cluster, and the number that were produced. We also introduce the concept of an auditing consumer here, which writes audit messages about how many messages it consumed for each 10 minute period. This completes an end-to-end accounting, from producer to consumer, of every message. A special consumer reads all audit messages out of the Kafka clusters and writes it to a database for performing comparisons.
In every message schema, we have a common header which provides, in part, the information needed to generate audit information. This includes a timestamp, set by the producer, specifying when the message was sent, as well as the application name and the hostname that the message originated at. This is utilized by the audit consumers when they read messages in order to count up messages by enough criteria to pinpoint a problem. The special audit messages themselves have start and end timestamps, which describe the 10 minute bucket they cover. They also have the topic name for which they apply, as well as a tier. The tier is a string which describes where this audit information came from. If it was sent in by a producer, the tier is always “producer”. This allows us to have a single tier that covers the production of all messages, since we have an environment where different services can produce the same type of message. The audit consumers use tier names that are specific to the Kafka cluster they are reading from, and consumers can specify their own tier name. Finally, there is a count of the number of messages in this bucket. Of course, the audit messages also have the common message header, so we can audit the audit if needed.
We have a few concerns with our audit system. One that comes up fairly frequently is that we are only counting messages, we’re not considering the content of the message. This means that if we duplicate one message, and lose a different message, we still think we don’t have a problem. The reality of the situation is that this really doesn’t happen, at least not with any exactness, at the number of messages we are passing. Additionally, if we wanted to we could use the data we are storing to audit messages for a particular service, or a particular server, and get much more granular. Lastly, one of the largest consumers of our audited messages, Hadoop, performs additional checks on the messages to trim out duplicates and check the message content using other fields in the header. Another one that has come up recently is that we do not audit all consumers. Most consumers will just monitor their lag to make sure they are not falling behind, but they do not check to make sure they read every message that was produced. Hadoop is the exception because of the importance and variety of work that is done there – it uses an auditing consumer that writes back audit messages so we consider it another tier in audit. We found that the way our relational database is set up currently would, most likely, not be able to handle the amount of activity it would get if every consumer started using the auditing consumer. We’ve been working on changes to the data backend to support this. We also cannot properly audit complex message flows. For a given topic, each tier must have 100% of the messages in it. This means that all of our local tracking clusters have the same tier name, tracking-local, whereas each aggregate cluster has a site specific tier name that differentiates it from the other aggregate clusters. If there’s a problem in the local tier, we don’t immediately know which datacenter the problem is in without further investigation. We also have problems with topics that take different paths to get to aggregate, which can come in when you have special clusters for outside clients. What we’d like to do here is to have an audit infrastructure that has knowledge of the mirror maker layout, and the whitelists and blacklists that each mirror maker is configured for, so that we can more easily determine exactly where a problem is when it occurs. This is a longer term project that we’re only starting to plan out right now.
Specifically to address concerns with running multiple tiers, there are several things we are looking for improvements in. One of the biggest is access controls. Right now, we have no way to prevent clients from producing to the aggregate clusters, and this can generate big problems with how we access and audit data. In a more general sense, we also need the ACLs to make sure we know who is producing to what topics, and to secure any topics that should be limited access. This could include topics that have details like credit cards or health information. We also need to have encryption. Starting out, encryption of the data in motion is the most critical part. We have the luxury of working entirely within our own networks and backbones, but even at that we cannot be assured of the connections between our datacenters. We are moving towards first encrypting these communications, and then making sure every client connection is encrypted. Thankfully, both this and the ACLs are currently being worked on by the open source developers through a series of proposals and tickets that will address authentication, authorization, and TLS encryption. Later on, we will need to consider encryption of data at rest. This, however, can be handled entirely in the clients where it is needed. Another piece that we need is quotas. We have no way right now to prevent one bad actor from performing a denial of service, even unintentionally, against a cluster. This is a particular concern for us when we have Hadoop jobs either consuming or producing, as they can spin up many many clients to do their work. Right now, we mitigate this by having separate clusters for Hadoop to work in, but we want to collapse this as much as possible to avoid duplication of messages. It also creates yet another set of things that can fail. Using quotas, we can set limits on how much damage a client can do, and assure that the right application gets penalized if they cause a problem, and not everyone else. This is also being worked on in an open source proposal from LinkedIn. Lastly, we need improvements with the way mirror maker does decompression and recompression of message batches. This is pretty obvious – we want to avoid overhead work wherever possible. As of yet, we don’t have a good proposed solution for how to handle it that doesn’t involve trickery with the existing protocol definition. There have been some recent improvements with decompression in Kafka, both driven by us and by other developers, but more work is needed. It’s something we’re talking a lot about internally.
So how can you get more involved in the Kafka community? The most obvious answer is to go apache.kafka.org. From there you can Join the mailing lists, either on the development or the user side You’ll find people on the #apache-kafka channel on Freenode IRC if you have questions We also coordinate meetups for both Kafka and Samza in the Bay Area, with streaming if you are not local You can also dive into the source repository, and work on and contribute your own tools back. Kafka may be young, but it’s a critical piece of data infrastructure for many of us.

Kafka at Scale: Multi-Tier Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Kafka at Scale: Multi-Tier Architectures

Similar to Kafka at Scale: Multi-Tier Architectures (20)

More from Todd Palino

More from Todd Palino (8)

Recently uploaded

Recently uploaded (20)

Kafka at Scale: Multi-Tier Architectures

Editor's Notes