SlideShare a Scribd company logo
1 of 16
The top 3 challenges
running multi-tenant
Flink at scale
Sharon Xie (sharon@decodable.co)
Founding Engineer at Decodable
Quick Intro of Decodable
Mental Model
● Connections to external
data systems
● Streams of data records
● Pipelines that process
data in streams
Flink at Decodable
● Runs the connections and pipelines jobs
● Not directly exposed to the users
● Mixed deployment modes for different use
cases
Why Flink?
● Purposely designed for stream processing
● Proven to scale in production
● Mature community
Cool, now the challenges start
The paradox of choice
● Massive Flink configurations
● Different APIs, deployment modes
Cloud resource sharing is great but …
● Noisy neighbors
● Blast radius
● Security
● Observability
Challenge 1: Infra resource management
Problems
● Isolation VS resource sharing
● Cost 💸
● Developer productivity
Design principles
● Use managed services
● Start with max cost efficiency
● Easy to support max isolation
Start with max cost efficiency
VPC
K8S Cluster
DB Cluster Cell: 1
Cell: 1
decodable-cell-1-control
decodable-cell-1-data
Cell: 2
Cell: 2
EKS:
cluster + nodes + ...
Subnets, SGs, ...
RDS Aurora cluster +
R/W instances
n
K8s Namespaces:
decodable-cell-2-control
decodable-cell-2-data
K8s Namespaces:
Kafka Cluster
MSK (AWS Managed
Kafka)
topics
topics
For max isolation deployment
VPC
DB Cluster
Kafka Cluster
K8S Cluster
Cell
Managed by terraform
module cell_1 {
cluster = module.clusters.cluster_1
database = module.databases.database_1
kafka = module.kafkas.kafka_1
}
module cluster_1 { network = module.vpcs.1 }
module database_1 { network = module.vpcs.1 }
module kafka_1 { network = module.vpcs.1 }
● Centralized infrastructure management
● Easy to provision a new cell with reusable infra code
Challenge 2: Observability
Information Overload
● Each Flink job has a lot of metrics
○ Operator level, task level, job level, JobManager,
TaskManagers
○ Now imagine we have a lot of jobs…
● Flink reports a lot of errors, while useful to debug but
needs classification
○ Connectivity issues with external systems
○ Internal temporary errors
○ Internal errors that requires someone to fix (eg: code
errors)
Internal Observability
Goals
● Optimize for on-call engineers QoL
● Actionable operational insights
What we did:
● Reduce the likelihood of configuration error
○ Flink pods have the same configuration
○ Run Flink SQL jobs
● Key metrics / events to monitor
○ Successful completion of checkpoints
○ Time from activation until all tasks are running
○ Job failures / restarts
● Heavy use of kubernetes tagging
User-facing Observability
Goals:
● Eliminate noise while knowing what’s going on
● Actionable error messages
● Integrate with users’ monitoring tools
What we did
● Expose job-level metrics
● Expose runtime errors that requires user actions
○ Error message classifier / processor
○ Custom metrics reporter to Kafka
● Streams for audit events & job metrics
Challenge 3: Authentication with external
systems
This challenge is REALLY about: managing sensitive info in cloud services
🔥Disclaimer🔥
● Lots of omitted details
● Focus on the rationale
● Don’t just copy
Principles
● Principle of least privilege
● Minimize the # of services with access to the sensitive info
● Audit everything
Handling Sensitive info
● Rely on cloud provider role-based auth wherever possible
● Stored in the AWS Secrets Manager
● APIs can only create secrets (can’t read)
● Secrets are only retrieved at activation time
● Flink pods don’t have access to the Secrets Manager
○ Secrets are created as k8s secrets and mounted to the pod running the job
Rationale
● Storing sensitive information together with other configurations (RDS) is bad
○ Many services have access to RDS
○ Hard to enforce more granular access control
○ Increase the risk profile of the RDS
● Only 1 service has access to the Secrets Manager
Parting Thoughts
To run Flink in a multi-tenant environment, you need deep knowledge about:
● Cloud infra/architecture
● Networking
● Operating Systems
● Distributed Systems
● …
It’s just hard!
2022
Build real-time data apps &
services. Fast.
decodable.co

More Related Content

What's hot

Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams APIconfluent
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkAljoscha Krettek
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022HostedbyConfluent
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 

What's hot (20)

Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on Flink
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 

Similar to The top 3 challenges running multi-tenant Flink at scale

Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntMark Grebler
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using AnsibleAlok Patra
 
The Fn Project by Jesse Butler
 The Fn Project by Jesse Butler The Fn Project by Jesse Butler
The Fn Project by Jesse ButlerOracle Developers
 
Serverless Boston @ Oracle Meetup
Serverless Boston @ Oracle MeetupServerless Boston @ Oracle Meetup
Serverless Boston @ Oracle MeetupWayne Scarano
 
Spring 21 Salesforce Release Webinar
Spring 21 Salesforce Release WebinarSpring 21 Salesforce Release Webinar
Spring 21 Salesforce Release Webinarbrightgenss
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
The FN Project by Maximilian Jerg
The FN Project by Maximilian JergThe FN Project by Maximilian Jerg
The FN Project by Maximilian JergHarald Schmaldienst
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Challenges In Modern Application
Challenges In Modern ApplicationChallenges In Modern Application
Challenges In Modern ApplicationRahul Kumar Gupta
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in ContainerizationRyan Hunter
 
J-Spring 2017 - Microservices in action at the Dutch National Police
J-Spring 2017 - Microservices in action at the Dutch National PoliceJ-Spring 2017 - Microservices in action at the Dutch National Police
J-Spring 2017 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...Continuous Delivery Amsterdam - Microservices in action at the Dutch National...
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...Bert Jan Schrijver
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containersaspyker
 
Viktor Turskyi "Effective NodeJS Application Development"
Viktor Turskyi "Effective NodeJS Application Development"Viktor Turskyi "Effective NodeJS Application Development"
Viktor Turskyi "Effective NodeJS Application Development"Fwdays
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open SourceAll Things Open
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Xiaoman DONG
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureYshay Yaacobi
 

Similar to The top 3 challenges running multi-tenant Flink at scale (20)

Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
The Fn Project by Jesse Butler
 The Fn Project by Jesse Butler The Fn Project by Jesse Butler
The Fn Project by Jesse Butler
 
Serverless Boston @ Oracle Meetup
Serverless Boston @ Oracle MeetupServerless Boston @ Oracle Meetup
Serverless Boston @ Oracle Meetup
 
Spring 21 Salesforce Release Webinar
Spring 21 Salesforce Release WebinarSpring 21 Salesforce Release Webinar
Spring 21 Salesforce Release Webinar
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
The FN Project by Maximilian Jerg
The FN Project by Maximilian JergThe FN Project by Maximilian Jerg
The FN Project by Maximilian Jerg
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Challenges In Modern Application
Challenges In Modern ApplicationChallenges In Modern Application
Challenges In Modern Application
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
J-Spring 2017 - Microservices in action at the Dutch National Police
J-Spring 2017 - Microservices in action at the Dutch National PoliceJ-Spring 2017 - Microservices in action at the Dutch National Police
J-Spring 2017 - Microservices in action at the Dutch National Police
 
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...Continuous Delivery Amsterdam - Microservices in action at the Dutch National...
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
Viktor Turskyi "Effective NodeJS Application Development"
Viktor Turskyi "Effective NodeJS Application Development"Viktor Turskyi "Effective NodeJS Application Development"
Viktor Turskyi "Effective NodeJS Application Development"
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
 

More from Flink Forward

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 

More from Flink Forward (14)

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

The top 3 challenges running multi-tenant Flink at scale

  • 1. The top 3 challenges running multi-tenant Flink at scale Sharon Xie (sharon@decodable.co) Founding Engineer at Decodable
  • 2. Quick Intro of Decodable Mental Model ● Connections to external data systems ● Streams of data records ● Pipelines that process data in streams
  • 3. Flink at Decodable ● Runs the connections and pipelines jobs ● Not directly exposed to the users ● Mixed deployment modes for different use cases Why Flink? ● Purposely designed for stream processing ● Proven to scale in production ● Mature community
  • 4. Cool, now the challenges start The paradox of choice ● Massive Flink configurations ● Different APIs, deployment modes Cloud resource sharing is great but … ● Noisy neighbors ● Blast radius ● Security ● Observability
  • 5. Challenge 1: Infra resource management Problems ● Isolation VS resource sharing ● Cost 💸 ● Developer productivity Design principles ● Use managed services ● Start with max cost efficiency ● Easy to support max isolation
  • 6. Start with max cost efficiency VPC K8S Cluster DB Cluster Cell: 1 Cell: 1 decodable-cell-1-control decodable-cell-1-data Cell: 2 Cell: 2 EKS: cluster + nodes + ... Subnets, SGs, ... RDS Aurora cluster + R/W instances n K8s Namespaces: decodable-cell-2-control decodable-cell-2-data K8s Namespaces: Kafka Cluster MSK (AWS Managed Kafka) topics topics
  • 7. For max isolation deployment VPC DB Cluster Kafka Cluster K8S Cluster Cell
  • 8. Managed by terraform module cell_1 { cluster = module.clusters.cluster_1 database = module.databases.database_1 kafka = module.kafkas.kafka_1 } module cluster_1 { network = module.vpcs.1 } module database_1 { network = module.vpcs.1 } module kafka_1 { network = module.vpcs.1 } ● Centralized infrastructure management ● Easy to provision a new cell with reusable infra code
  • 9. Challenge 2: Observability Information Overload ● Each Flink job has a lot of metrics ○ Operator level, task level, job level, JobManager, TaskManagers ○ Now imagine we have a lot of jobs… ● Flink reports a lot of errors, while useful to debug but needs classification ○ Connectivity issues with external systems ○ Internal temporary errors ○ Internal errors that requires someone to fix (eg: code errors)
  • 10. Internal Observability Goals ● Optimize for on-call engineers QoL ● Actionable operational insights What we did: ● Reduce the likelihood of configuration error ○ Flink pods have the same configuration ○ Run Flink SQL jobs ● Key metrics / events to monitor ○ Successful completion of checkpoints ○ Time from activation until all tasks are running ○ Job failures / restarts ● Heavy use of kubernetes tagging
  • 11. User-facing Observability Goals: ● Eliminate noise while knowing what’s going on ● Actionable error messages ● Integrate with users’ monitoring tools What we did ● Expose job-level metrics ● Expose runtime errors that requires user actions ○ Error message classifier / processor ○ Custom metrics reporter to Kafka ● Streams for audit events & job metrics
  • 12. Challenge 3: Authentication with external systems This challenge is REALLY about: managing sensitive info in cloud services 🔥Disclaimer🔥 ● Lots of omitted details ● Focus on the rationale ● Don’t just copy Principles ● Principle of least privilege ● Minimize the # of services with access to the sensitive info ● Audit everything
  • 13. Handling Sensitive info ● Rely on cloud provider role-based auth wherever possible ● Stored in the AWS Secrets Manager ● APIs can only create secrets (can’t read) ● Secrets are only retrieved at activation time ● Flink pods don’t have access to the Secrets Manager ○ Secrets are created as k8s secrets and mounted to the pod running the job
  • 14. Rationale ● Storing sensitive information together with other configurations (RDS) is bad ○ Many services have access to RDS ○ Hard to enforce more granular access control ○ Increase the risk profile of the RDS ● Only 1 service has access to the Secrets Manager
  • 15. Parting Thoughts To run Flink in a multi-tenant environment, you need deep knowledge about: ● Cloud infra/architecture ● Networking ● Operating Systems ● Distributed Systems ● … It’s just hard!
  • 16. 2022 Build real-time data apps & services. Fast. decodable.co

Editor's Notes

  1. What else? User management / access control Multi-tenant resource management
  2. No UDF SQL operator’s behavior is known and predictable stress the idea of knowing the behavior of these operators on most workloads to make it clear we're not just YOLO'ing our config.
  3. API service has the most privilege as it can R/W secrets Still trapped in the cell when 💩happens Heavily audit activities in the api service Easily remove the service account’s assumed role when bad things happen