Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Ashok Kadambala, JP Morgan Chase) Kafka Summit SF 2019
The document outlines the design and implementation of a managed Kafka service that addresses various challenges such as security, resiliency, and governance for multi-tenancy in a cloud environment. It covers key components including cluster design, data-driven control plane, and self-service APIs for applications, along with lessons learned and future enhancements in the Kafka ecosystem. Presenters Vishnu Balusu and Ashok Kadambala highlight the importance of automating processes and maintaining a centralized schema registry for effective management.
Introduction to secure Kafka implementation in a multi-tenant environment presented at Kafka Summit 2019.
Detailed agenda outlining two parts focusing on design principles, cluster design, APIs, resiliency, and future directions.
Presenting challenges faced like varied designs and lack of governance, and proposing a fully managed service solution with specified design principles.
Insights into Kafka ecosystem showcasing infrastructure metrics including 400 Apps, 102 Clusters, and 1.5 PB configured.
Details on cluster design specifying a 5-node structure with resilience and security configurations including SASL and TLS.
Focus on the data-driven control plane functionality across clusters ensuring efficient management and telemetry.
Explaining logical abstraction for metadata management and automated policy enforcement across tenant namespaces.
Techniques for maintaining application connectivity and health indices in a resilient environment utilizing active-active cluster setups.
Describing the self-service API for creating topics, including the specifics of topic replication and configurations for various scenarios.
Best practices for managing schema registered data while ensuring security and maintaining data lineage.
Showcasing Kafka Streams with defined application flows, highlighting conflict resolution in a multi-tenant setup.
Outline of orchestrating Kafka broker updates ensuring health checks and minimal downtime during patching.
Explaining a common control plane strategy that allows for ubiquitous access across various platforms including public and private clouds.
Summarizing lessons learned during implementation phases, focusing on automation, centralized management, and the complexity of scaling.
Vision for future Kafka enhancements including fleet management, self-healing capabilities, and chaos engineering principles.
Conclusion and gratitude expressed at the end of the presentation.
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Ashok Kadambala, JP Morgan Chase) Kafka Summit SF 2019
1.
Kafka as aManaged Service
Secure Kafka at scale in true Multi-Tenant Environment
Kafka Summit, SFO 2019
Presenters: Vishnu Balusu & Ashok Kadambala
2.
2
Agenda
Part 1
• Motivation& Design Principles
• Kafka-scape
• Cluster Design
• Data-Driven Control Plane
• App Resiliency
Part 2
• Self-Service API
• Schema Management
• Kafka Streams
• Orchestrator (Cluster Patching)
• Ubiquitous Access (Multi-Cloud)
Final Remarks
• Lessons Learned
• Future Ahead
3.
3
PROBLEM
STATEMENT
Why a ManagedService?
Many bespoke implementations across the firm
• Varied design and patterns
• Different standards of security and resiliency
• Lack of firm-wide governance in risk management
• Lack of real end-to-end self-service
• No metadata driven APIs
• No centralized view of Data Lineage
A Fully managed Service with Design Principles
ü Centralized Service
ü Secure from Start
ü Consumable from Hybrid Cloud and Platforms
ü Data Driven End-to-End Self-Service APIs
ü Scalable on demand
ü Built per Customer Requirements
Solution
Next Exit
8
Control Plane :Multi-Tenancy & Capacity Management
X Topic with Size X GB
• Logical abstraction at metadata
level for every Kafka cluster
• Allows applications to reserve
storage on the cluster
• All the Kafka artefacts created by
application are maintained within
the application namespace
• Topic Sizes and Quotas are enforced
Tenant 1 Tenant 2
10
10
15
5
5
2
Physical Kafka
Cluster
5
10
10
15
5
5
2
namespaces
Automated admin
workflow
5
Metadata
Entitlements
Governance
Quotas
Tenant NKafka cluster logical
abstraction Metadata
Entitlements
Governance
Quotas
Metadata
Entitlements
Governance
Quotas
9.
9
App Resiliency :Connection Profile
• Unique Cluster Names – RREnnnn (Region, Env, Numeric)
• Connection profile is queried via API using cluster name
• Applications are immune from Infra changes
{
"clusterName": "NAD1700",
"topicSuffix": "na1700",
"kafkaBrokerConnectionProtocols": [
{
"protocol": "SASL_SSL",
"bootstrapServersStr": "",
"serviceName": " jpmckafka",
}
],
"schemaRegistryURLs": [
],
"restProxyURLs": [],
"clusterReplicationPattern": "ACTIVE_ACTIVE",
"replicatedClusterProfile": {
"clusterName": "NAD1701",
"topicSuffix": "na1701",
"kafkaBrokerConnectionProtocols": [
{
"protocol": "SASL_SSL",
"bootstrapServersStr": "",
"serviceName": “jpmckafka",
}
],
"schemaRegistryURLs": [
],
"restProxyURLs": []
}
}
/applications/{appid}/cluster/{ClusterName}/connectionProfile Connection Profile for a given clusterGET
10.
10
App Resiliency :Cluster Health Index
• Health Index is determined from
ü Ability to produce/consume externally as a client
ü Number of Kafka/zookeeper processes up and running
ü Offline partitions within the cluster
• Cluster Index is persisted as a metric in Prometheus and
exposed via an API to application teams
• Recommended to integrate into Automated Application
Resiliency
Control PlaneHealth Check
API
PeriodicHealth
checkonclusters
QueryCluster
Metrics
ScrapeCluster
HealthIndex
Determine Cluster
Health Index
11.
11
App Resiliency :Active-Active Clusters
• Better utilization of infrastructure
• Do not require much manual intervention recovering from datacenter failure
• Eventual Consistency | Highly Available | Partition Tolerance
Multi-DC Resiliency
16
Schema Management
• GETrequest should be open to everyone
• POST/PUT/DELETE requests should be authorized
• Schema registry ownership and lineage should be maintained
Securing Schema Registry
resource.extension.class
Fully qualified class name of a valid implementation of the SchemaRegistryResourceExtension interface. This can be used to inject
user defined resources like filters. Typically used to add custom capability like logging, security, etc.
17.
17
Schema Registry: AuthXExtension
@Priority(Priorities.AUTHENTICATION)
public class AuthenticationFilter implements ContainerRequestFilter {
public AuthenticationFilter() {
}
@Override
public void filter(ContainerRequestContext containerRequestContext) {
}
}
resource.extension.class=com.jpmorgan.kafka.schemaregistry.security.SchemaRegistryAuthXExtension
package com.jpmorgan.kafka.schemaregistry.security;
public class SchemaRegistryAuthXExtension implements SchemaRegistryResourceExtension
{
@Override
public void register(Configurable<?> configurable,
SchemaRegistryConfig schemaRegistryConfig,
SchemaRegistry schemaRegistry) throws SchemaRegistryException {
configurable.register(new AuthenticationFilter());
}
@Override
public void close() {
}
}
21
• Find ActiveController broker and patch it at
the end
• For each kafka broker
1. Stop Kafka Broker
2. Deploy config/binaries
3. Start Kafka broker
4. Invoke Health check
• Wait for URPs to be zero
• Produce/Consume on test topic
5. Abort patching if health check fails
Orchestrator: Cluster Patching
1
2
n
Metadata
Control Plane
Telemetry
Orchestrator
22.
22
Ubiquitous Access (Multi-Cloud)
•Common Control Plane
•
• OnPrem Private Cloud : Market Place Tile
• OnPrem Kube Platform : Service Catalog
• Public Cloud : TLS/Oauth
• OAuth via Federated ADFS (KIP-255: OAuth Authentication via SASL/OAUTHBEARER)
23.
23
Lessons Learned
Data
api
Tollgates
Automate Everything{large scale infra}
Centralized Schema Registry {multiple clusters}
New
Features New Features ≠ Stability
0 1 2 3 4 5 6 7 8 9
Offset Management {replicated clusters}
0 1 2 3 4 5 6 7 8 9
≠
Scaling & Monitoring is not an easy job !!
24.
24
Future ahead…
Fleet Management
(StateMachines)
Self-Healing Kafka Auto Throttling &
Kill Switch
Centralized
Schema Management
2.5 DC
Stretch Clusters
Chaos Engineering
Failure is a norm!!!
Action