This document discusses how the Kafka platform team at Shopify improved their Kafka monitoring and access control. They implemented certificate-based authentication to identify Kafka clients. They also logged all produce and consume requests to track what clients were doing. Finally, they configured ACLs to restrict which clients could access which topics, bringing more control and preventing incidents. This improved their on-call experiences and allowed the platform team to better manage their large Kafka deployment.
4. Kafka Platform @ Shopify
● Kafka is a business critical service
● Kafka clusters are deployed on Kubernetes.
● Several clusters across multiple Google Cloud Platform (GCP) regions.
5. Kafka Platform @ Shopify
Several apps and services use Kafka for a variety of use cases at Shopify:
1. Log ingestion pipelines
2. Change Data Capture
3. Stream processing using Apache Flink
4. Messaging platform for internal apps,
And more…
6. How do we manage our topics and track ownership?
Aim: All topic management is performed only through this repository!
● Topics are defined in YAML files.
○ Related topics can be grouped into one file.
○ Topics may have an owner(s) using GitHub code owner.
○ Most topics did not have owners ☹
● Users are free to configure their topics.
○ Partition counts.
○ Kafka topic configs ex. Retention.ms
○ With some sane defaults and checks.
7. How are apps deployed within Shopify?
● At Shopify, most apps and services are deployed to
Kubernetes
● Each app runs in its own namespace providing some
degree of isolation from other apps.
● A K8s controller (an app extending K8s) which
provides access to 2 things per namespace:
○ A certificate to connect to Kafka
○ A config with information on how to connect to
the relevant cluster.
○ These certificates all have the same identity
8. Kafka Platform @ Shopify
How are apps deployed within Shopify?
Most apps are registered with an internal service registry
● This registry maps each app to a team
● Which namespace is it deployed to, etc.
9. Some fun Kafka stats
710
Brokers
14
Clusters
1PB+
Avg. Disk Usage
8M+
Avg. Msgs/Sec
8K+
Active Topics
200+
Kafka Clients
12. What on-call was like
Same old alerts, often masking one of many reason causing it
● Someone decided to run a mini scale test.
● Clients directly accessing production clusters to modify topic configs, bypass schema
management.
● A topic has sudden increase in traffic
○ Internal clients can and have started to DDOS the brokers those partitions live on.
○ It was hard to make good use of quotas as all clients had the same identity.
13. What on-call was like
Same old alerts, often masking one of many reason causing it
● A topic is suddenly produced to months after being deleted.
● Someone decided to run a backfill.
○ Often starting Friday evenings for single partition topics :)
● And so much more…
And this starts a process of us diving into our many many codebase trying to find the team
responsible. In such situations, context is everything! And we had none.
14. 1. Who is connecting to Kafka?
2. What are they doing?
3. How can we have control over it?
23. Controller 2: kafka-buddy
Responsible for:
● Watching for changes in k8s Namespace objects
● Updating each Namespace with a Certificate
● Setting Certificate subject to be the Namespace (application’s name)
31. Performance Considerations
Required: No discernible impact on disk, CPU, memory, and latency
Kafka Topic Log Compaction
● Produce keyed log events:
<principal>|<operation>|<topic>
Key: CN=dhruv|READ|topic-1
Value: { … }
Key: CN=justin|WRITE|topic-2
Value: { … }
Authorizer Caching
● Cache keys and only produce new logs for new
keys
● Fast cache reads + infrequent cache writes
● Fixed cache duration from cache insertion time,
last access time, etc.
36. What are ACLs and why do we care about them?
TL;DR ACLs (Access control lists) are a way to restrict access to certain operations for
certain users
37.
38. What are ACLs and why do we care about them?
1. Any Kafka client had unrestricted permission to modify Kafka resources.
2. Users could bypass schema management (Shopify/schemas).
3. Have no logs of these operations (more on this later!).
While we have a high level of trust internally, this still left our clusters open to unplanned
operations and human errors.
39. The “numbers” slide 😅 once again
710
Brokers
14
Clusters
1PB+
Avg. Disk Usage
8M+
Avg. Msgs/Sec
8K+
Active Topics
200+
Kafka Clients
41. New platform goals for the Kafka Users
Kafka users should only be able to:
1. Consume/produce to their topics. 🍞🧈
2. Allowed to perform basic operations ex. resetting consumer groups.
Users should absolutely not be able to manage topics themselves!
42. New platform goals for the Kafka Platform
The Kafka platform:
1. Better observability/logging when clients perform a restricted operation.
2. Internal apps should have still have permission to perform necessary operations.
a. Cruise control can still perform reblances.
b. Only the topics management services can manage topics, etc…
3. Use quotas: We can find the principal using the producer/consumer metrics.
43. What do ACLs look like and how do we manage them?
And where even are these ACLs stored?
44. What do ACLs look like and how do we manage them?
Not fun to read or parse ^
Don’t even try, save your eyes
45. What do ACLs look like and how do we manage them?
Translation:
The user/principal kafka-cruise-control (an internal app)
is allowed to
perform the operations create and alter <- Still not fun to read or parse ^
on the resource topic
which matches the pattern __KafkaCruiseControl
Here the “Principal” is the Common Name (CN) of a SSL Certificate.
● All ACLs are applied to certificates.
● This makes matching clients to their certificates important (coming soon!)
● <- Relevant part of a SSL Cert
46. What do ACLs look like and how do we manage them?
Translation:
The user/principal kafka-cruise-control
is allowed to
perform the operations create and alter
on the resource topic
which matches the pattern …
MUCH BETTER… ->
47. What do ACLs look like and how do we manage them?
● ACLs most users would care about.
● The rest are for internal apps that help
manage Kafka
48. What do ACLs look like and how do we manage them?
We wrote a script to parse existing ACLs and use the YAML file to define them
55. So why do we not just log everything? Well… we tried :)
Service Disruption #1902: Elevated errors on kafka cluster XYZ after deployment
● The dashboard we just saw is only populated for staging environments.
● On streaming ~7 min worth of logs to inspect them, the size was +600 MiB…
● We care a lot more about “Deny” operations than approved ones.
56. Some ACL Gotchas
● ACL operations have precedence
○ “Deny” take precedence over “Allow”.
○ With multiple “Allow”, the most permissive rules are used.
○ Sane defaults, but these don’t allow for us to have “exclusive” per resource level rules:
■ Ex. We can’t have the behavior: All principals can read all topics except for “topic_” which only be
produced to by principal X
■ Even Kafka ACLs expect rules to be defined per topic level.
○ We can once again modify the Authorizer class to support this.
● If no ACLs have been applied for a resource (ex. topic), creating a single rule will enable ACLs for it.
○ If you want to apply ACLs with no downtime or incidents, consider rules like break glass first
57. Now we have the tools to answer all these questions:
1. Who is connecting to Kafka?
2. What are they doing?
3. How can we have control over it?
On-call is no longer dreaded. Our team has more bandwidth to do other work.
58. You can do it too! 🫵
We had K8s namespaces and a service registry tool, what might you have?
LDAP, Active Directory, EC2/VM instance names, Environment Variables, Certificate CNs.
Get ahead of the problem! Secure your platform and start enforcing topic ownership today!
From ACLs to Quotas, Kafka is by default designed for apps to have unique identities!