SlideShare a Scribd company logo
1 of 60
Download to read offline
Knock Knock, Who’s There?
Identifying Kafka Clients in a Multi-tenant Environment
Justin Chen & Dhruv Jauhar
Knock Knock, Who’s There?
Identifying Kafka Clients in a Multi-tenant Environment
&
Securing Your Clusters Using Kafka ACLs
Justin Chen & Dhruv Jauhar
Kafka Platform @ Shopify
Kafka Platform @ Shopify
● Kafka is a business critical service
● Kafka clusters are deployed on Kubernetes.
● Several clusters across multiple Google Cloud Platform (GCP) regions.
Kafka Platform @ Shopify
Several apps and services use Kafka for a variety of use cases at Shopify:
1. Log ingestion pipelines
2. Change Data Capture
3. Stream processing using Apache Flink
4. Messaging platform for internal apps,
And more…
How do we manage our topics and track ownership?
Aim: All topic management is performed only through this repository!
● Topics are defined in YAML files.
○ Related topics can be grouped into one file.
○ Topics may have an owner(s) using GitHub code owner.
○ Most topics did not have owners ☹
● Users are free to configure their topics.
○ Partition counts.
○ Kafka topic configs ex. Retention.ms
○ With some sane defaults and checks.
How are apps deployed within Shopify?
● At Shopify, most apps and services are deployed to
Kubernetes
● Each app runs in its own namespace providing some
degree of isolation from other apps.
● A K8s controller (an app extending K8s) which
provides access to 2 things per namespace:
○ A certificate to connect to Kafka
○ A config with information on how to connect to
the relevant cluster.
○ These certificates all have the same identity
Kafka Platform @ Shopify
How are apps deployed within Shopify?
Most apps are registered with an internal service registry
● This registry maps each app to a team
● Which namespace is it deployed to, etc.
Some fun Kafka stats
710
Brokers
14
Clusters
1PB+
Avg. Disk Usage
8M+
Avg. Msgs/Sec
8K+
Active Topics
200+
Kafka Clients
What motivated this talk?
Our on-call experience 😅
What on-call was like
Same old alerts, often masking one of many reason causing it
● Someone decided to run a mini scale test.
● Clients directly accessing production clusters to modify topic configs, bypass schema
management.
● A topic has sudden increase in traffic
○ Internal clients can and have started to DDOS the brokers those partitions live on.
○ It was hard to make good use of quotas as all clients had the same identity.
What on-call was like
Same old alerts, often masking one of many reason causing it
● A topic is suddenly produced to months after being deleted.
● Someone decided to run a backfill.
○ Often starting Friday evenings for single partition topics :)
● And so much more…
And this starts a process of us diving into our many many codebase trying to find the team
responsible. In such situations, context is everything! And we had none.
1. Who is connecting to Kafka?
2. What are they doing?
3. How can we have control over it?
1. Who is connecting to Kafka?
Identifying Users: SSL Authentication
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
...
Signature Algorithm: ecdsa-with-SHA256
Issuer: C=CA, ST=Ontario, L=Ottawa, O=Shopify, CN=Shopify Kafka
Validity
Not Before: Sep 24 00:23:24 2022 GMT
Not After : Sep 25 00:23:53 2022 GMT
Subject: CN=kafka@shopify.com, O=Shopify
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:
...
Previously…
Certificate Management Flow
K8s cluster
ClusterIssuer
Namespace (my-app-1)
Pod
Pod
Pods
Secret
Certificate
Namespace (my-controllers)
cert-manager
kafka-buddy
Pod
Pod
Kubernetes Controllers
Pods
Pod
Pod
Services
Pod
Pod
CRDs
Kubernetes Objects
Pod
Pod
Controllers
…
Watch
Update
Kubernetes Controllers
cert-manager
(OSS)
kafka-buddy
Controller 1: cert-manager
cert-manager K8s Resources
Controller 2: kafka-buddy
Responsible for:
● Watching for changes in k8s Namespace objects
● Updating each Namespace with a Certificate
● Setting Certificate subject to be the Namespace (application’s name)
All together…
K8s cluster
ClusterIssuer
Namespace (my-app-1)
Pod
Pod
Pods
Secret
Certificate
Namespace (my-controllers)
cert-manager
kafka-buddy
1
2
3
4
5
6
2. What are they doing?
Kafka’s AclAuthorizer
AclAuthorizer
my-app
Principal = User:CN=my-app, O=Shopify is Allowed
Operation = READ on resource = Topic:my-topic
Principal = User:CN=my-app, O=Shopify is Allowed
Operation = WRITE on resource = Topic:my-topic
Produce/Consume
Request ACL query
DEBUG Level Difficulties
DEBUG Level Difficulties
Custom Kafka Authorizer
Gathering Produce + Consume Logs
Produce/Consume
Request
Topic: _consumer_log
Topic: _producer_log
my-app MyAuthorizer
{
“principal”: “CN=my-app”,
“operation”: “WRITE”
“topic”: “topic-2”
}
{
“principal”: “CN=my-app”,
“operation”: “READ”
“topic”: “topic-1”
}
Performance Considerations
Required: No discernible impact on disk, CPU, memory, and latency
Kafka Topic Log Compaction
● Produce keyed log events:
<principal>|<operation>|<topic>
Key: CN=dhruv|READ|topic-1
Value: { … }
Key: CN=justin|WRITE|topic-2
Value: { … }
Authorizer Caching
● Cache keys and only produce new logs for new
keys
● Fast cache reads + infrequent cache writes
● Fixed cache duration from cache insertion time,
last access time, etc.
Leveraging Producer + Consumer Data
Key: CN=team-a-app|WRITE|topic_1
Value: { … }
Key: CN=team-b-app|READ|topic_1
Value: { … }
Key: CN=team-a-app|WRITE|topic_2
Value: { … }
_producer_log _consumer_log
Now we know:
1. Who is connecting to Kafka?
2. What are they doing?
Answering the remaining question:
3. How can we have control over it?
What are ACLs and why do we care about them?
TL;DR ACLs (Access control lists) are a way to restrict access to certain operations for
certain users
What are ACLs and why do we care about them?
1. Any Kafka client had unrestricted permission to modify Kafka resources.
2. Users could bypass schema management (Shopify/schemas).
3. Have no logs of these operations (more on this later!).
While we have a high level of trust internally, this still left our clusters open to unplanned
operations and human errors.
The “numbers” slide 😅 once again
710
Brokers
14
Clusters
1PB+
Avg. Disk Usage
8M+
Avg. Msgs/Sec
8K+
Active Topics
200+
Kafka Clients
Sooo… if something goes wrong, the impact can be
HUGE 🔥
New platform goals for the Kafka Users
Kafka users should only be able to:
1. Consume/produce to their topics. 🍞🧈
2. Allowed to perform basic operations ex. resetting consumer groups.
Users should absolutely not be able to manage topics themselves! 󰢃
New platform goals for the Kafka Platform
The Kafka platform:
1. Better observability/logging when clients perform a restricted operation.
2. Internal apps should have still have permission to perform necessary operations.
a. Cruise control can still perform reblances.
b. Only the topics management services can manage topics, etc…
3. Use quotas: We can find the principal using the producer/consumer metrics.
What do ACLs look like and how do we manage them?
And where even are these ACLs stored?
What do ACLs look like and how do we manage them?
Not fun to read or parse ^
Don’t even try, save your eyes
What do ACLs look like and how do we manage them?
Translation:
The user/principal kafka-cruise-control (an internal app)
is allowed to
perform the operations create and alter <- Still not fun to read or parse ^
on the resource topic
which matches the pattern __KafkaCruiseControl
Here the “Principal” is the Common Name (CN) of a SSL Certificate.
● All ACLs are applied to certificates.
● This makes matching clients to their certificates important (coming soon!)
● <- Relevant part of a SSL Cert
What do ACLs look like and how do we manage them?
Translation:
The user/principal kafka-cruise-control
is allowed to
perform the operations create and alter
on the resource topic
which matches the pattern …
MUCH BETTER… ->
What do ACLs look like and how do we manage them?
● ACLs most users would care about.
● The rest are for internal apps that help
manage Kafka
What do ACLs look like and how do we manage them?
We wrote a script to parse existing ACLs and use the YAML file to define them
Better observability of ACLs
Better observability of ACLs
Better observability of ACLs
What the splunk logs for this alert look like ^
So why do we not just log everything?
Why did we extend the authorizer class to collect the same data?
Well… we tried :)
DEBUG Level Difficulties
DEBUG Level Difficulties
So why do we not just log everything? Well… we tried :)
Service Disruption #1902: Elevated errors on kafka cluster XYZ after deployment
● The dashboard we just saw is only populated for staging environments.
● On streaming ~7 min worth of logs to inspect them, the size was +600 MiB…
● We care a lot more about “Deny” operations than approved ones.
Some ACL Gotchas
● ACL operations have precedence
○ “Deny” take precedence over “Allow”.
○ With multiple “Allow”, the most permissive rules are used.
○ Sane defaults, but these don’t allow for us to have “exclusive” per resource level rules:
■ Ex. We can’t have the behavior: All principals can read all topics except for “topic_” which only be
produced to by principal X
■ Even Kafka ACLs expect rules to be defined per topic level.
○ We can once again modify the Authorizer class to support this.
● If no ACLs have been applied for a resource (ex. topic), creating a single rule will enable ACLs for it.
○ If you want to apply ACLs with no downtime or incidents, consider rules like break glass first
Now we have the tools to answer all these questions:
1. Who is connecting to Kafka?
2. What are they doing?
3. How can we have control over it?
On-call is no longer dreaded. Our team has more bandwidth to do other work.
You can do it too! 🫵
We had K8s namespaces and a service registry tool, what might you have?
LDAP, Active Directory, EC2/VM instance names, Environment Variables, Certificate CNs.
Get ahead of the problem! Secure your platform and start enforcing topic ownership today!
From ACLs to Quotas, Kafka is by default designed for apps to have unique identities!
Thank you!
Any questions? 😃

More Related Content

Similar to Securing Kafka Clusters Using ACLs and Identifying Kafka Clients

20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel PartnersCraeg Strong
 
Distributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and ScalaDistributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and ScalaMax Alexejev
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel PartnersCraeg Strong
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfA Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfAvinashUpadhyaya3
 
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...Craeg Strong
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Worksconfluent
 
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsKubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsRightScale
 
Stories from running Kafka on K8S.pdf
Stories from running Kafka on K8S.pdfStories from running Kafka on K8S.pdf
Stories from running Kafka on K8S.pdfAvinashUpadhyaya3
 
Custom management apps for Kafka
Custom management apps for KafkaCustom management apps for Kafka
Custom management apps for KafkaSotaro Kimura
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar clusterShivji Kumar Jha
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Xiaoman DONG
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value storesBhasker Kode
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
AD113  Speed Up Your Applications w/ Nginx and PageSpeedAD113  Speed Up Your Applications w/ Nginx and PageSpeed
AD113 Speed Up Your Applications w/ Nginx and PageSpeededm00se
 
Data-Streaming at DKV
Data-Streaming at DKVData-Streaming at DKV
Data-Streaming at DKVconfluent
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to KubernetesVishal Biyani
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowEd Balduf
 

Similar to Securing Kafka Clusters Using ACLs and Identifying Kafka Clients (20)

20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
 
Distributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and ScalaDistributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and Scala
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfA Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdf
 
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Works
 
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsKubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
 
Stories from running Kafka on K8S.pdf
Stories from running Kafka on K8S.pdfStories from running Kafka on K8S.pdf
Stories from running Kafka on K8S.pdf
 
Custom management apps for Kafka
Custom management apps for KafkaCustom management apps for Kafka
Custom management apps for Kafka
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Function as a Service
Function as a ServiceFunction as a Service
Function as a Service
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
 
Meetup callback
Meetup callbackMeetup callback
Meetup callback
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value stores
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
AD113  Speed Up Your Applications w/ Nginx and PageSpeedAD113  Speed Up Your Applications w/ Nginx and PageSpeed
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
 
Data-Streaming at DKV
Data-Streaming at DKVData-Streaming at DKV
Data-Streaming at DKV
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Securing Kafka Clusters Using ACLs and Identifying Kafka Clients

  • 1. Knock Knock, Who’s There? Identifying Kafka Clients in a Multi-tenant Environment Justin Chen & Dhruv Jauhar
  • 2. Knock Knock, Who’s There? Identifying Kafka Clients in a Multi-tenant Environment & Securing Your Clusters Using Kafka ACLs Justin Chen & Dhruv Jauhar
  • 4. Kafka Platform @ Shopify ● Kafka is a business critical service ● Kafka clusters are deployed on Kubernetes. ● Several clusters across multiple Google Cloud Platform (GCP) regions.
  • 5. Kafka Platform @ Shopify Several apps and services use Kafka for a variety of use cases at Shopify: 1. Log ingestion pipelines 2. Change Data Capture 3. Stream processing using Apache Flink 4. Messaging platform for internal apps, And more…
  • 6. How do we manage our topics and track ownership? Aim: All topic management is performed only through this repository! ● Topics are defined in YAML files. ○ Related topics can be grouped into one file. ○ Topics may have an owner(s) using GitHub code owner. ○ Most topics did not have owners ☹ ● Users are free to configure their topics. ○ Partition counts. ○ Kafka topic configs ex. Retention.ms ○ With some sane defaults and checks.
  • 7. How are apps deployed within Shopify? ● At Shopify, most apps and services are deployed to Kubernetes ● Each app runs in its own namespace providing some degree of isolation from other apps. ● A K8s controller (an app extending K8s) which provides access to 2 things per namespace: ○ A certificate to connect to Kafka ○ A config with information on how to connect to the relevant cluster. ○ These certificates all have the same identity
  • 8. Kafka Platform @ Shopify How are apps deployed within Shopify? Most apps are registered with an internal service registry ● This registry maps each app to a team ● Which namespace is it deployed to, etc.
  • 9. Some fun Kafka stats 710 Brokers 14 Clusters 1PB+ Avg. Disk Usage 8M+ Avg. Msgs/Sec 8K+ Active Topics 200+ Kafka Clients
  • 10. What motivated this talk? Our on-call experience 😅
  • 11.
  • 12. What on-call was like Same old alerts, often masking one of many reason causing it ● Someone decided to run a mini scale test. ● Clients directly accessing production clusters to modify topic configs, bypass schema management. ● A topic has sudden increase in traffic ○ Internal clients can and have started to DDOS the brokers those partitions live on. ○ It was hard to make good use of quotas as all clients had the same identity.
  • 13. What on-call was like Same old alerts, often masking one of many reason causing it ● A topic is suddenly produced to months after being deleted. ● Someone decided to run a backfill. ○ Often starting Friday evenings for single partition topics :) ● And so much more… And this starts a process of us diving into our many many codebase trying to find the team responsible. In such situations, context is everything! And we had none.
  • 14. 1. Who is connecting to Kafka? 2. What are they doing? 3. How can we have control over it?
  • 15. 1. Who is connecting to Kafka?
  • 16. Identifying Users: SSL Authentication Certificate: Data: Version: 3 (0x2) Serial Number: ... Signature Algorithm: ecdsa-with-SHA256 Issuer: C=CA, ST=Ontario, L=Ottawa, O=Shopify, CN=Shopify Kafka Validity Not Before: Sep 24 00:23:24 2022 GMT Not After : Sep 25 00:23:53 2022 GMT Subject: CN=kafka@shopify.com, O=Shopify Subject Public Key Info: Public Key Algorithm: rsaEncryption Public-Key: (2048 bit) Modulus: ...
  • 18. Certificate Management Flow K8s cluster ClusterIssuer Namespace (my-app-1) Pod Pod Pods Secret Certificate Namespace (my-controllers) cert-manager kafka-buddy
  • 23. Controller 2: kafka-buddy Responsible for: ● Watching for changes in k8s Namespace objects ● Updating each Namespace with a Certificate ● Setting Certificate subject to be the Namespace (application’s name)
  • 24. All together… K8s cluster ClusterIssuer Namespace (my-app-1) Pod Pod Pods Secret Certificate Namespace (my-controllers) cert-manager kafka-buddy 1 2 3 4 5 6
  • 25. 2. What are they doing?
  • 26. Kafka’s AclAuthorizer AclAuthorizer my-app Principal = User:CN=my-app, O=Shopify is Allowed Operation = READ on resource = Topic:my-topic Principal = User:CN=my-app, O=Shopify is Allowed Operation = WRITE on resource = Topic:my-topic Produce/Consume Request ACL query
  • 30. Gathering Produce + Consume Logs Produce/Consume Request Topic: _consumer_log Topic: _producer_log my-app MyAuthorizer { “principal”: “CN=my-app”, “operation”: “WRITE” “topic”: “topic-2” } { “principal”: “CN=my-app”, “operation”: “READ” “topic”: “topic-1” }
  • 31. Performance Considerations Required: No discernible impact on disk, CPU, memory, and latency Kafka Topic Log Compaction ● Produce keyed log events: <principal>|<operation>|<topic> Key: CN=dhruv|READ|topic-1 Value: { … } Key: CN=justin|WRITE|topic-2 Value: { … } Authorizer Caching ● Cache keys and only produce new logs for new keys ● Fast cache reads + infrequent cache writes ● Fixed cache duration from cache insertion time, last access time, etc.
  • 32.
  • 33. Leveraging Producer + Consumer Data Key: CN=team-a-app|WRITE|topic_1 Value: { … } Key: CN=team-b-app|READ|topic_1 Value: { … } Key: CN=team-a-app|WRITE|topic_2 Value: { … } _producer_log _consumer_log
  • 34. Now we know: 1. Who is connecting to Kafka? 2. What are they doing?
  • 35. Answering the remaining question: 3. How can we have control over it?
  • 36. What are ACLs and why do we care about them? TL;DR ACLs (Access control lists) are a way to restrict access to certain operations for certain users
  • 37.
  • 38. What are ACLs and why do we care about them? 1. Any Kafka client had unrestricted permission to modify Kafka resources. 2. Users could bypass schema management (Shopify/schemas). 3. Have no logs of these operations (more on this later!). While we have a high level of trust internally, this still left our clusters open to unplanned operations and human errors.
  • 39. The “numbers” slide 😅 once again 710 Brokers 14 Clusters 1PB+ Avg. Disk Usage 8M+ Avg. Msgs/Sec 8K+ Active Topics 200+ Kafka Clients
  • 40. Sooo… if something goes wrong, the impact can be HUGE 🔥
  • 41. New platform goals for the Kafka Users Kafka users should only be able to: 1. Consume/produce to their topics. 🍞🧈 2. Allowed to perform basic operations ex. resetting consumer groups. Users should absolutely not be able to manage topics themselves! 󰢃
  • 42. New platform goals for the Kafka Platform The Kafka platform: 1. Better observability/logging when clients perform a restricted operation. 2. Internal apps should have still have permission to perform necessary operations. a. Cruise control can still perform reblances. b. Only the topics management services can manage topics, etc… 3. Use quotas: We can find the principal using the producer/consumer metrics.
  • 43. What do ACLs look like and how do we manage them? And where even are these ACLs stored?
  • 44. What do ACLs look like and how do we manage them? Not fun to read or parse ^ Don’t even try, save your eyes
  • 45. What do ACLs look like and how do we manage them? Translation: The user/principal kafka-cruise-control (an internal app) is allowed to perform the operations create and alter <- Still not fun to read or parse ^ on the resource topic which matches the pattern __KafkaCruiseControl Here the “Principal” is the Common Name (CN) of a SSL Certificate. ● All ACLs are applied to certificates. ● This makes matching clients to their certificates important (coming soon!) ● <- Relevant part of a SSL Cert
  • 46. What do ACLs look like and how do we manage them? Translation: The user/principal kafka-cruise-control is allowed to perform the operations create and alter on the resource topic which matches the pattern … MUCH BETTER… ->
  • 47. What do ACLs look like and how do we manage them? ● ACLs most users would care about. ● The rest are for internal apps that help manage Kafka
  • 48. What do ACLs look like and how do we manage them? We wrote a script to parse existing ACLs and use the YAML file to define them
  • 51. Better observability of ACLs What the splunk logs for this alert look like ^
  • 52. So why do we not just log everything? Why did we extend the authorizer class to collect the same data? Well… we tried :)
  • 55. So why do we not just log everything? Well… we tried :) Service Disruption #1902: Elevated errors on kafka cluster XYZ after deployment ● The dashboard we just saw is only populated for staging environments. ● On streaming ~7 min worth of logs to inspect them, the size was +600 MiB… ● We care a lot more about “Deny” operations than approved ones.
  • 56. Some ACL Gotchas ● ACL operations have precedence ○ “Deny” take precedence over “Allow”. ○ With multiple “Allow”, the most permissive rules are used. ○ Sane defaults, but these don’t allow for us to have “exclusive” per resource level rules: ■ Ex. We can’t have the behavior: All principals can read all topics except for “topic_” which only be produced to by principal X ■ Even Kafka ACLs expect rules to be defined per topic level. ○ We can once again modify the Authorizer class to support this. ● If no ACLs have been applied for a resource (ex. topic), creating a single rule will enable ACLs for it. ○ If you want to apply ACLs with no downtime or incidents, consider rules like break glass first
  • 57. Now we have the tools to answer all these questions: 1. Who is connecting to Kafka? 2. What are they doing? 3. How can we have control over it? On-call is no longer dreaded. Our team has more bandwidth to do other work.
  • 58. You can do it too! 🫵 We had K8s namespaces and a service registry tool, what might you have? LDAP, Active Directory, EC2/VM instance names, Environment Variables, Certificate CNs. Get ahead of the problem! Secure your platform and start enforcing topic ownership today! From ACLs to Quotas, Kafka is by default designed for apps to have unique identities!