Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
The top 3 challenges running multi-tenant Flink at scale
1. The top 3 challenges
running multi-tenant
Flink at scale
Sharon Xie (sharon@decodable.co)
Founding Engineer at Decodable
2. Quick Intro of Decodable
Mental Model
● Connections to external
data systems
● Streams of data records
● Pipelines that process
data in streams
3. Flink at Decodable
● Runs the connections and pipelines jobs
● Not directly exposed to the users
● Mixed deployment modes for different use
cases
Why Flink?
● Purposely designed for stream processing
● Proven to scale in production
● Mature community
4. Cool, now the challenges start
The paradox of choice
● Massive Flink configurations
● Different APIs, deployment modes
Cloud resource sharing is great but …
● Noisy neighbors
● Blast radius
● Security
● Observability
5. Challenge 1: Infra resource management
Problems
● Isolation VS resource sharing
● Cost 💸
● Developer productivity
Design principles
● Use managed services
● Start with max cost efficiency
● Easy to support max isolation
7. For max isolation deployment
VPC
DB Cluster
Kafka Cluster
K8S Cluster
Cell
8. Managed by terraform
module cell_1 {
cluster = module.clusters.cluster_1
database = module.databases.database_1
kafka = module.kafkas.kafka_1
}
module cluster_1 { network = module.vpcs.1 }
module database_1 { network = module.vpcs.1 }
module kafka_1 { network = module.vpcs.1 }
● Centralized infrastructure management
● Easy to provision a new cell with reusable infra code
9. Challenge 2: Observability
Information Overload
● Each Flink job has a lot of metrics
○ Operator level, task level, job level, JobManager,
TaskManagers
○ Now imagine we have a lot of jobs…
● Flink reports a lot of errors, while useful to debug but
needs classification
○ Connectivity issues with external systems
○ Internal temporary errors
○ Internal errors that requires someone to fix (eg: code
errors)
10. Internal Observability
Goals
● Optimize for on-call engineers QoL
● Actionable operational insights
What we did:
● Reduce the likelihood of configuration error
○ Flink pods have the same configuration
○ Run Flink SQL jobs
● Key metrics / events to monitor
○ Successful completion of checkpoints
○ Time from activation until all tasks are running
○ Job failures / restarts
● Heavy use of kubernetes tagging
11. User-facing Observability
Goals:
● Eliminate noise while knowing what’s going on
● Actionable error messages
● Integrate with users’ monitoring tools
What we did
● Expose job-level metrics
● Expose runtime errors that requires user actions
○ Error message classifier / processor
○ Custom metrics reporter to Kafka
● Streams for audit events & job metrics
12. Challenge 3: Authentication with external
systems
This challenge is REALLY about: managing sensitive info in cloud services
🔥Disclaimer🔥
● Lots of omitted details
● Focus on the rationale
● Don’t just copy
Principles
● Principle of least privilege
● Minimize the # of services with access to the sensitive info
● Audit everything
13. Handling Sensitive info
● Rely on cloud provider role-based auth wherever possible
● Stored in the AWS Secrets Manager
● APIs can only create secrets (can’t read)
● Secrets are only retrieved at activation time
● Flink pods don’t have access to the Secrets Manager
○ Secrets are created as k8s secrets and mounted to the pod running the job
14. Rationale
● Storing sensitive information together with other configurations (RDS) is bad
○ Many services have access to RDS
○ Hard to enforce more granular access control
○ Increase the risk profile of the RDS
● Only 1 service has access to the Secrets Manager
15. Parting Thoughts
To run Flink in a multi-tenant environment, you need deep knowledge about:
● Cloud infra/architecture
● Networking
● Operating Systems
● Distributed Systems
● …
It’s just hard!
What else?
User management / access control
Multi-tenant resource management
No UDF
SQL operator’s behavior is known and predictable
stress the idea of knowing the behavior of these operators on most workloads to make it clear we're not just YOLO'ing our config.
API service has the most privilege as it can R/W secrets
Still trapped in the cell when 💩happens
Heavily audit activities in the api service
Easily remove the service account’s assumed role when bad things happen