Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clusters Inside Each End User’s Own Cloud Infrastructure

Kubernetes Clusters At Scale
Managing Hundreds Apache Pinot Kubernetes Clusters
Inside Each End User’s Own Cloud Infrastructure
Xiaoman Dong
DevOps, Software Engineer @StarTree

Apache Pinot
• OLAP Datastore
• Columnar, Indexed storage
• Real-time Low latency analytics
• Distributed – highly available, reliable, scalable
• Lambda architecture
• SQL Interface
• Open Source - Apache TLP

Typical/Traditional SaaS
● K8S Owned by SaaS Company
● Data Stays in SaaS Company
Virtual Private Cloud

We Do Delegated Management Solution
● K8S Owned by Customer
● Data Stays inside Customer’s
Virtual Private Cloud
● Fully Managed by Us

Design Context throughout the Talk
The 3 Major Constraints
● Cloud Boundaries
● Optimized for Apache Pinot
● Scale to hundreds or more
We will focus on how these 3 makes our system special

How do we design such a system?
(My job is safe from ChatGPT ... for now)

The journey: design such a system
• We are going to start small, automate, and dive deeper
• Always think about our context: customer’s cloud, our backend

Step 1: Creating the Clusters
• Each customer will be able to create and see their own clusters
• Self-serve provisioning via UI
• Multi-cloud support (AWS, GCP, Azure)

Step 1: Provisioning
The Manual Way
Automate this!
● Log into AWS UI by credentials provided by the customer
● Create Account, Networking, Kubernetes Cluster
● ❌ Bash script the aws eks creation
● ✅ Write your own microservice
- Use aws client library
- Terraform

Step 1: Provisioning (Cont’d)
- Scale to 1k customers?

Step 1: Provisioning - Orchestration
Orchestration Engine
Workflow Needed:
1. Create Account
2. Create Network
3. Create NodeGroup
4. Create K8S
5. Create …
6. Notify Finished
Retry in each step, report status

Step 2: Installing Applications
Goal: The customers needs to access their clusters with Pinot Running

Step 2: Installing Applications
The Manual Way
Automate This
● ❌ kubectl apply -f all-apps.yaml
● ✅ helm upgrade --install startree-platform …
● Build our own helm charts
● Run our own private helm repo (or pay for AWS ECR)
● All applications deployed via Helm Chart
● Call helm libraries in our code

K8S Cluster Runs as a Platform, Applications are Pluggable
Charts and docker owned by separate teams 😍

Step 3: Networking
A huge topic worth a dedicated session
Public facing vs. “Internal” facing (VPC Peering)
Kubernetes Has Good Network Modeling and EcoSystems
● Ingress - We choose Traefik, easy for teams to define ingress
● LoadBalancer by Each Cloud Provider
● ExtraVPC Peering on demand
● Multi-Zone High Availability

Step 4: TLS and Certificates - Problem
Secure connection is required nearly everywhere
● Even withinVPC/Firewall customers request it
● Manual certificate generation will not scale
Certificate has expiration dates
● Automated renewal is needed
● First Time Creation == Future Renewal

Step 4: TLS and Certificates - Knowledge
Facts of Certificates
- Proves that you own this DNS name properly
- To generate certificate, we need to do DNS related challenge to prove ownership
- Established by chain of trusts
- Issued by well-known/pre-installed 3rd party issuers like ZeroSSL

Step 4: TLS and Certificates: Centralized
Option 1: Centralized solution
✅ Better Security
❌ Harder to Scale

Step 4: TLS and Certificates: Distributed
Decentralized Certificate Renewal
❌ Less Secure
✅ Easier to Scale Up

Special Part for Delegated Management Solution
Step 5,6,7…
The Usual DevOps stuff
● OIDC for AuthZ/AuthN
● Prometheus + AlertManager for Observability
● Logging, Debugging
● Backup and Disaster Recovery
● Metrics push to centralized monitoring and/or customer’s metrics storage
● Backup to customer’s deep store

Checkpoint 1: Kubernetes Fleet Management
Architecture So Far A mini version of multi-cloud Kubernetes fleet
management system, like the KubeSphere

Wait, What About Apache Pinot?
Pinot Kubernetes Operator

Configuration/Customization
Templated Environment Creation
● Some customers like to enable groovy in Query, some don't
● Customizations/Configurations are applied onto templates
● Customization are applied like aVisitor pattern in the old Design Patterns

Are we there yet?
“Ops” part of DevOps!
* Image courtesy https://devopedia.org/devops

Version and Upgrades (Cont’d)
The version matrix Lessons Learnt
● Create good release pipeline with tests
● Discipline: avoid releasing versions with
breaking changes
● Keep helm chart and image tag the
same as release version

Efficiency and Reliability
Efficiency and Reliability are key to Scale up
● Discipline in DevOps is important
● No architecture is bulletproof
● Less Outages == Better Efficiency
● DevOps are created for end to end ownership

Efficiency and Reliability - Cont’d
Best Practices
● Build Good Infra Integration/Regression Test
● Trunk-Based Release Pipelines
○ Always release from master
○ Say no for release branches
● Do not customize by Kubectl command

Operations and OnCalls
There is no silver bullet for OnCall
• Discipline and Process
- Root Cause every outage
- Follow up on every outage
• Effective Alerts
- Differentiate alerts from signals
- Review and Keep Improving
- Build metrics to measure effectiveness

Lessons Learnt
Security design in Provisioned Cluster is hard
• Centralized Control, less Scalability
• Decentralized Control, harder to protect credentials
• Build good debugging support on TLS certificates
Do not run complicated Terraforms
• Bugs if state gets complicated, unwanted recreation
• Internal states of terraform are hard to keep track

Lessons Learnt (cont’d)
Certificate Issuer like ZeroSSL may partially go down for half a day
• No new customer can onboard during that downtime
One 3rd Party Helm Repo goes down and blocks customer cluster upgrade
• Serve Helm Charts by your own repo like JFrog

What’s Ahead
• Improving Design For Layering
• Improve Resource Efficiency
• No Downtime Upgrade
• Cluster Federation
• …

Thank you!
Reach me via https://www.linkedin.com/in/xiaoman/

Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clusters Inside Each End User’s Own Cloud Infrastructure

More Related Content

Similar to Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clusters Inside Each End User’s Own Cloud Infrastructure

Recently uploaded

Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clusters Inside Each End User’s Own Cloud Infrastructure