The top 3 challenges running multi-tenant Flink at scale

The top 3 challenges
running multi-tenant
Flink at scale
Sharon Xie (sharon@decodable.co)
Founding Engineer at Decodable

Quick Intro of Decodable
Mental Model
● Connections to external
data systems
● Streams of data records
● Pipelines that process
data in streams

Flink at Decodable
● Runs the connections and pipelines jobs
● Not directly exposed to the users
● Mixed deployment modes for different use
cases
Why Flink?
● Purposely designed for stream processing
● Proven to scale in production
● Mature community

Cool, now the challenges start
The paradox of choice
● Massive Flink configurations
● Different APIs, deployment modes
Cloud resource sharing is great but …
● Noisy neighbors
● Blast radius
● Security
● Observability

Challenge 1: Infra resource management
Problems
● Isolation VS resource sharing
● Cost 💸
● Developer productivity
Design principles
● Use managed services
● Start with max cost efficiency
● Easy to support max isolation

Start with max cost efficiency
VPC
K8S Cluster
DB Cluster Cell: 1
Cell: 1
decodable-cell-1-control
decodable-cell-1-data
Cell: 2
Cell: 2
EKS:
cluster + nodes + ...
Subnets, SGs, ...
RDS Aurora cluster +
R/W instances
n
K8s Namespaces:
decodable-cell-2-control
decodable-cell-2-data
K8s Namespaces:
Kafka Cluster
MSK (AWS Managed
Kafka)
topics
topics

For max isolation deployment
VPC
DB Cluster
Kafka Cluster
K8S Cluster
Cell

Managed by terraform
module cell_1 {
cluster = module.clusters.cluster_1
database = module.databases.database_1
kafka = module.kafkas.kafka_1
}
module cluster_1 { network = module.vpcs.1 }
module database_1 { network = module.vpcs.1 }
module kafka_1 { network = module.vpcs.1 }
● Centralized infrastructure management
● Easy to provision a new cell with reusable infra code

Challenge 2: Observability
Information Overload
● Each Flink job has a lot of metrics
○ Operator level, task level, job level, JobManager,
TaskManagers
○ Now imagine we have a lot of jobs…
● Flink reports a lot of errors, while useful to debug but
needs classification
○ Connectivity issues with external systems
○ Internal temporary errors
○ Internal errors that requires someone to fix (eg: code
errors)

Internal Observability
Goals
● Optimize for on-call engineers QoL
● Actionable operational insights
What we did:
● Reduce the likelihood of configuration error
○ Flink pods have the same configuration
○ Run Flink SQL jobs
● Key metrics / events to monitor
○ Successful completion of checkpoints
○ Time from activation until all tasks are running
○ Job failures / restarts
● Heavy use of kubernetes tagging

User-facing Observability
Goals:
● Eliminate noise while knowing what’s going on
● Actionable error messages
● Integrate with users’ monitoring tools
What we did
● Expose job-level metrics
● Expose runtime errors that requires user actions
○ Error message classifier / processor
○ Custom metrics reporter to Kafka
● Streams for audit events & job metrics

Challenge 3: Authentication with external
systems
This challenge is REALLY about: managing sensitive info in cloud services
🔥Disclaimer🔥
● Lots of omitted details
● Focus on the rationale
● Don’t just copy
Principles
● Principle of least privilege
● Minimize the # of services with access to the sensitive info
● Audit everything

Handling Sensitive info
● Rely on cloud provider role-based auth wherever possible
● Stored in the AWS Secrets Manager
● APIs can only create secrets (can’t read)
● Secrets are only retrieved at activation time
● Flink pods don’t have access to the Secrets Manager
○ Secrets are created as k8s secrets and mounted to the pod running the job

Rationale
● Storing sensitive information together with other configurations (RDS) is bad
○ Many services have access to RDS
○ Hard to enforce more granular access control
○ Increase the risk profile of the RDS
● Only 1 service has access to the Secrets Manager

Parting Thoughts
To run Flink in a multi-tenant environment, you need deep knowledge about:
● Cloud infra/architecture
● Networking
● Operating Systems
● Distributed Systems
● …
It’s just hard!

2022
Build real-time data apps &
services. Fast.
decodable.co

The top 3 challenges running multi-tenant Flink at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The top 3 challenges running multi-tenant Flink at scale

Similar to The top 3 challenges running multi-tenant Flink at scale (20)

More from Flink Forward

More from Flink Forward (14)

Recently uploaded

Recently uploaded (20)

The top 3 challenges running multi-tenant Flink at scale

Editor's Notes