Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Ashok Kadambala, JP Morgan Chase) Kafka Summit SF 2019

Kafka as a Managed Service
Secure Kafka at scale in true Multi-Tenant Environment
Kafka Summit, SFO 2019
Presenters: Vishnu Balusu & Ashok Kadambala

2
Agenda
Part 1
• Motivation & Design Principles
• Kafka-scape
• Cluster Design
• Data-Driven Control Plane
• App Resiliency
Part 2
• Self-Service API
• Schema Management
• Kafka Streams
• Orchestrator (Cluster Patching)
• Ubiquitous Access (Multi-Cloud)
Final Remarks
• Lessons Learned
• Future Ahead

3
PROBLEM
STATEMENT
Why a Managed Service?
Many bespoke implementations across the firm
• Varied design and patterns
• Different standards of security and resiliency
• Lack of firm-wide governance in risk management
• Lack of real end-to-end self-service
• No metadata driven APIs
• No centralized view of Data Lineage
A Fully managed Service with Design Principles
ü Centralized Service
ü Secure from Start
ü Consumable from Hybrid Cloud and Platforms
ü Data Driven End-to-End Self-Service APIs
ü Scalable on demand
ü Built per Customer Requirements
Solution
Next Exit

4
Kafkascape
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
400 Apps
100 Production
102 Clusters
510 Nodes
40 Production
13,000 Topics
1300 Production
1.5 PB
(Configured)
Confluent 5.2.2
(Apache Kafka 2.2.1)

5
Cluster Design
5-node clusters
Replication factor of 4 & handles failure of 2 nodes
Dedicated Zookeeper ensemble per cluster
SASL & TLS for inter component connectivity
Plaintext is disabled
Default ports are not used
Resiliency Security
Kafka Cluster (5-Node)
Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper
Replicator Replicator Replicator Replicator Replicator
Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker
Schema
Registry
Schema
Registry
Schema
Registry
Schema
Registry
Schema
Registry
Agent Agent Agent Agent Agent
Node 1 Node 2 Node 3 Node 4 Node 5
SASL
https/TLS
Kerberos
SASL
https/TLS
SASL&TLS

6
Data Driven Control Plane
data
admin
admin
telemetry

7
Control Plane
Functional View
Clusters

8
Control Plane : Multi-Tenancy & Capacity Management
X Topic with Size X GB
• Logical abstraction at metadata
level for every Kafka cluster
• Allows applications to reserve
storage on the cluster
• All the Kafka artefacts created by
application are maintained within
the application namespace
• Topic Sizes and Quotas are enforced
Tenant 1 Tenant 2
10
10
15
5
5
2
Physical Kafka
Cluster
5
10
10
15
5
5
2
namespaces
Automated admin
workflow
5
Metadata
Entitlements
Governance
Quotas
Tenant NKafka cluster logical
abstraction Metadata
Entitlements
Governance
Quotas
Metadata
Entitlements
Governance
Quotas

9
App Resiliency : Connection Profile
• Unique Cluster Names – RREnnnn (Region, Env, Numeric)
• Connection profile is queried via API using cluster name
• Applications are immune from Infra changes
{
"clusterName": "NAD1700",
"topicSuffix": "na1700",
"kafkaBrokerConnectionProtocols": [
{
"protocol": "SASL_SSL",
"bootstrapServersStr": "",
"serviceName": " jpmckafka",
}
],
"schemaRegistryURLs": [
],
"restProxyURLs": [],
"clusterReplicationPattern": "ACTIVE_ACTIVE",
"replicatedClusterProfile": {
"clusterName": "NAD1701",
"topicSuffix": "na1701",
"kafkaBrokerConnectionProtocols": [
{
"protocol": "SASL_SSL",
"bootstrapServersStr": "",
"serviceName": “jpmckafka",
}
],
"schemaRegistryURLs": [
],
"restProxyURLs": []
}
}
/applications/{appid}/cluster/{ClusterName}/connectionProfile Connection Profile for a given clusterGET

10
App Resiliency : Cluster Health Index
• Health Index is determined from
ü Ability to produce/consume externally as a client
ü Number of Kafka/zookeeper processes up and running
ü Offline partitions within the cluster
• Cluster Index is persisted as a metric in Prometheus and
exposed via an API to application teams
• Recommended to integrate into Automated Application
Resiliency
Control PlaneHealth Check
API
PeriodicHealth
checkonclusters
QueryCluster
Metrics
ScrapeCluster
HealthIndex
Determine Cluster
Health Index

11
App Resiliency : Active-Active Clusters
• Better utilization of infrastructure
• Do not require much manual intervention recovering from datacenter failure
• Eventual Consistency | Highly Available | Partition Tolerance
Multi-DC Resiliency

13
Topic Creation
{
"topicName": “kafka-summit-sfo",
"clusterName": “NAD1700",
"numOfPartitions": 10,
"compactedTopic": false,
"topicSizeInGB": 10,
"retentionInDays": 2,
"owningApplicationId": 12345,
"productionPromotable": true
}
App
12345

14
NAD1700
kafka-summit-sfo-na1700
NAD1701
P
P
C
C
replication factor 4
min.in.sync.replicas 2
Self-Service API : Active-Active Topics

15
NAD1700
kafka-summit-sfo
NAD1701
kafka-summit-sfo
P
C
replication factor 4
min.in.sync.replicas 2
Cluster is Active-Active but topic is Active-Passive (for e.g. Compacted Topics)
KIP-382: MirrorMaker 2.0
Self-Service API : Active-Passive Topics

16
Schema Management
• GET request should be open to everyone
• POST/PUT/DELETE requests should be authorized
• Schema registry ownership and lineage should be maintained
Securing Schema Registry
resource.extension.class
Fully qualified class name of a valid implementation of the SchemaRegistryResourceExtension interface. This can be used to inject
user defined resources like filters. Typically used to add custom capability like logging, security, etc.

17
Schema Registry: AuthX Extension
@Priority(Priorities.AUTHENTICATION)
public class AuthenticationFilter implements ContainerRequestFilter {
public AuthenticationFilter() {
}
@Override
public void filter(ContainerRequestContext containerRequestContext) {
}
}
resource.extension.class=com.jpmorgan.kafka.schemaregistry.security.SchemaRegistryAuthXExtension
package com.jpmorgan.kafka.schemaregistry.security;
public class SchemaRegistryAuthXExtension implements SchemaRegistryResourceExtension
{
@Override
public void register(Configurable<?> configurable,
SchemaRegistryConfig schemaRegistryConfig,
SchemaRegistry schemaRegistry) throws SchemaRegistryException {
configurable.register(new AuthenticationFilter());
}
@Override
public void close() {
}
}

18
Kafka Streams
{
"streamApplicationId": “user-transactions-stream",
"clusterName": “NAD100",
"streamAuthId": “someuser@REALM.COM",
"streamThroughputInKBPS": 1000,
"owningApplicationId": 1234,
"streamUserTopics": {
"inputTopics": [
“user-transactions”
],
"intermediateTopics": [],
"outputTopics": [
“patterns”,
“rewards”,
“purchases”
]
}
}
Example Use Case
Onboard Stream API
Masking
user-transactions
Rewards Patterns
rewards patterns
purchases

19
Stream Application Id Conflicts
Stream Application Id conflicts MUST BE handled in a multi-tenant environment to avoid unintentional consequences
props.put(StreamsConfig.CLIENT_ID_CONFIG, “user-transactions-stream");
//using CLI
./kafka-acls.sh --authorizer-properties
zookeeper.connect=server:port --add --allow-
principal User:a_user --resource-pattern-type
prefixed --topic user-transactions-stream --group
user-transactions-stream --transactional-id user-
transactions-stream --operation All
//using Admin Client
CreateAclsOptions createAclsOptions = new
CreateAclsOptions();
....
.... PatternType.PREFIXED) ....
adminClient.createAcls(aclBindings,
createAclsOptions).all().get(60,
TimeUnit.SECONDS);
OR
user-transactions-stream
user-transactions
user-transactions-stream-audit
Stream Application Id

20
{
"deployKeytabs": false,
"componentsInScope": [
{
"component": “KAFKA",
"deployConfig": true,
"deployBinaries": true,
"binariesVersion": “Confluent-5.2.2"
}
],
"goodToGoEvidence": {
"evidenceType": "NOT_APPLICABLE",
"evidenceId": "string"
}
}
Orchestrator: Cluster Patching
1
2
n
Metadata
Control Plane
Telemetry
Orchestrator

21
• Find Active Controller broker and patch it at
the end
• For each kafka broker
1. Stop Kafka Broker
2. Deploy config/binaries
3. Start Kafka broker
4. Invoke Health check
• Wait for URPs to be zero
• Produce/Consume on test topic
5. Abort patching if health check fails
Orchestrator: Cluster Patching
1
2
n
Metadata
Control Plane
Telemetry
Orchestrator

22
Ubiquitous Access (Multi-Cloud)
• Common Control Plane
•
• OnPrem Private Cloud : Market Place Tile
• OnPrem Kube Platform : Service Catalog
• Public Cloud : TLS/Oauth
• OAuth via Federated ADFS (KIP-255: OAuth Authentication via SASL/OAUTHBEARER)

23
Lessons Learned
Data
api
Tollgates
Automate Everything {large scale infra}
Centralized Schema Registry {multiple clusters}
New
Features New Features ≠ Stability
0 1 2 3 4 5 6 7 8 9
Offset Management {replicated clusters}
0 1 2 3 4 5 6 7 8 9
≠
Scaling & Monitoring is not an easy job !!

24
Future ahead…
Fleet Management
(State Machines)
Self-Healing Kafka Auto Throttling &
Kill Switch
Centralized
Schema Management
2.5 DC
Stretch Clusters
Chaos Engineering
Failure is a norm!!!
Action

Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Ashok Kadambala, JP Morgan Chase) Kafka Summit SF 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Ashok Kadambala, JP Morgan Chase) Kafka Summit SF 2019

Similar to Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Ashok Kadambala, JP Morgan Chase) Kafka Summit SF 2019 (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Ashok Kadambala, JP Morgan Chase) Kafka Summit SF 2019