Convertimos una DB open source en un SaaS multi-tenant usando K8s

Cómo convertimos una DB open source
en un SaaS multi-tenant usando K8s
Javier Ramirez
@supercoco9

We would like to be known for
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)

Open Source Business Models
● Professional Services/Support
● Open Core
● Software as a Service
https://en.wikipedia.org/wiki/Business_models_for_open-source_software

Why we chose to launch
a managed service
QuestDB is simple to operate, but…
… Some companies ingest several thousands of events per second (some
of them up to 200 thousand per second)
… And expect predictable performance
… While running queries on top.
That’s not that simple anymore.
Also, some teams just prefer not to manage any infrastructure at all.

QuestDB from a DevOps perspective
● Single running process. Multiple interfaces
● Port 9000 for web interface and REST API
● Port 9003 for health check and prometheus metrics
● Port 9009 for ILP (fast ingestion, socket based)
● Port 8812 for PostgreSQL (pgwire) protocol
● conf, log, and db folders
○ db folder can be distributed across primary + secondary disk
● Commands for backup and data partitions lifecycle

The basics of a managed data service
● No operations (almost)
● Security everywhere
● Sensible defaults (depending on instance size)
● Access to config tuning
● Specialized support
● Management panel
● Backend admin website
● Monitoring dashboard and alerting
● Simple and quick Provisioning/deprovisioning
● Flexible sizing
● Managed upgrades
● Managed snapshots
● Multi regional availability
● Billing and payments
● User management/Single sign on
● Choice of different cloud providers
● Bring your own cloud/on-premises deployment

You can always
launch with less
We decided to launch a minimal, by invitation only, private beta.
Invited a customer every ~2 weeks, with fully functional QuestDB
instances, but with some parts of the cloud experience still under
development.

The basics of a managed data service*
● No operations (almost)
● Security everywhere
● Sensible defaults (depending on instance size)
● Access to config tuning
● Specialized support
● Management panel
● Backend admin website
● Monitoring dashboard and alerting
● Simple and quick Provisioning/deprovisioning
● Flexible sizing
● Managed upgrades
● Managed snapshots
● Multi regional availability
● Billing and payments
● User management/Single sign on
● Choice of different cloud providers
● Bring your own cloud/on-premises deployment
* Everything except the parts in
orange is already publicly available

Realise what you are not to do
● Hardware and low-level services (other than QuestDB) => AWS
● Metering usage of services => Metronome
● User authentication, single sign on => Auth0
● User payments => Stripe
● Global taxes => Anrok

The main/provisioner K8s cluster
● A single cluster (no multi region)
● Amazon ELB + Nginx for the cloud web interface
● Monitoring dashboards (powered by QuestDB) for all instances
● General user and instance management. Stored at HA Amazon RDS
PostgreSQL
● Handles provisioning of all the user instances, using Kafka as event broker
● Autoscaling is done with Karpenter

A tenant cluster per region
● System nodes
○ Shared resources (metric aggregation, alerting, certificate manager, AWS
Secrets, SSL termination, networking...)
○ Controlled by AWS. Autoscaled with Karpenter
● Instance nodes, added to the cluster but not controlled by AWS
○ Kubernetes operator (under development) for improved life cycle
○ Isolated with namespaces, k8s policies, and AWS policies
○ Runs QuestDB, collects metrics via Vector-dev and logs via Loki
○ Executes scheduled/ad-hoc incremental snapshots
○ Non shared EBS (gp3) volume

Tenant cluster (Region C)
Tenant cluster (Region B)
Main/provisioner K8s cluster
Tenant cluster (Region A)
System nodes
Instance
(customer)
nodes
PostgreSQL
QuestDB
Kafka
Loki
Prometheus
agent
Loki
Prometheus
agent
QuestDB
Vector.dev
Vector.dev
Snapshots
Snapshots

EBS/cloud disks are “slow”
https://demo.questdb.io
https://github.com/javier/questdb-quickstart

Creative cloud disk
management
Local Nvme drives are fast.
But they don’t survive instance restarts.

Creative cloud disk
management
Option A. Write in parallel into two instances, one with local, one with
EBS. Read from the one with local disk. Also helps with HA.

Creative cloud disk
management
Option B. RAID 1 with a local and EBS disk. Write and read always
from local drive. Writes still slow, but very fast reads.

Creative cloud disk
management
Option B. RAID 1 with a local and EBS disk. Write and read always
from local drive. Writes still slow, but very fast reads.
Forked aws-ebs-csi-driver for specific disk issues on our instances.

SSL/TLS everywhere
● QuestDB console and REST API (HTTP)
○ Nginx
● Pgwire protocol (TCP/IP)
○ PgBouncer (considering envoy with pg module)
● ILP protocol (TCP socket)
○ HAProxy

Proxy and SSL/TLS
challenges
Added a bit of latency. Indiscernible for most use cases. Fined tuned
for QuestDB typical data ingestion patterns.

Proxy and SSL/TLS
challenges
Large imports using the REST API would break as Nginx tries to hold
the whole file in memory.

Proxy and SSL/TLS
challenges
Large imports using the REST API would break as Nginx tries to hold
the whole file in memory.
HAProxy by default starts a thread per CPU, even when in K8s. Had
performance issues until we noticed and configured accordingly.

Provisioning: three levels
● The main cluster (single one, but needed for dev environments)
● The tenant clusters (one for every supported region)
● The QuestDB customer instances

Provisioning the main cluster
● Terraform
○ Amazon EKS + EBS volumes
○ Amazon RDS
○ Amazon Managed Kafka

Provisioning the tenant/region clusters
● Terraform
○ Amazon EKS + EBS volumes
○ Manual configuration to add the
cluster on Argo CD for automatic
upgrades (only for production
clusters)

Provisioning the customer instances
● Customer initiates state change on control panel (running on main cluster)
○ Backend sends JSON message to Kafka
○ Backend running on tenant cluster receives Kafka message
■ Backend initiates change. (Via K8s operator soon)
■ Instance is restarted (if needed)
■ Backend sends event to mark change finished
■ Front-end control panel displays status change (if needed)
■ Backend on main cluster inserts change on Postgresql (if needed)

Upgrading the clusters
● Happens when
○ New version of control panel
○ New version of cloud backend
○ New version for dependencies
○ New QuestDB release
● Managed via
○ Argo CD listening to github

Upgrading QuestDB
● Once a new enterprise (or OSS, but legacy) release is available, Argo CD will make it
available from the control panel, both for new instances and as an optional
upgrade for existing ones
● Both instance creation and instance changes (including upgrades, changes on
instance or disk sizing, changes in running state, deletions, or configuration
changes) are done via the provisioning mechanism initiated with a Kafka message

Managed snapshots
● Manual snapshots are provisioned (using Kafka) and executed as a short-lived pod
on the customer instance. Storage taken by manual snapshots is billed separately
● Automatic scheduled snapshots are also provisioned using Kafka, and executed as
long-lived pods that wake up on schedule. Scheduled snapshots can be paused or
deleted, in which case the pods will also be paused or deleted. Schedule is by an
hourly range, and we can initiate at any point during the hour to limit concurrency.
The last 7 days of scheduled snapshots is included on instance base price.

Monitoring and
Observability
QuestDB exposes prometheus metrics. Vector.dev exports them.
We use Grafana for internal observability of every node and instance.
We ingest selected metrics into QuestDB to power customer dashboards (uplot).
Logs are exported via Loki. Also to S3 initially, but we are removing that.
We cross monitor the different K8s clusters.

Repositories (Go, Python, Typescript, yaml)
● QuestDB public repositories (questdb + ui)
● Enterprise QuestDB repository, adding
proprietary features
● Saas-infra
● Saas-client
● Saas-client-tests
● Saas-client-mocks
● Saas-client-deployer
● Saas-helm-charts
● Saas-backend
● Saas-exporter
● Questdb-operator
● Saas-operator
● Saas-provisioner
● Aws-ebc-csi-driver
● Auth0-actionscloud
● saas-admin

Operating the QuestDB cloud
● Almost everything is automated
● When there is an alert, send via pagerduty and slack
○ On-call/Team rotas
○ Playbooks for: Provisioner, Backend, Frontend, Ingress, Backup,
HAproxy, PGBouncer, and QuestDB
● Admin tasks are still very much manual, with some templates for
querying Grafana and PostgreSQL for common tasks

The (fully remote) team
● 1.5 front-enders
● 1.5 backenders (the other 0.5 from above)
● 2 infrastructure/devops

The (fully remote) extended team
1.5 front-enders
1.5 backenders (the other 0.5 from above)
2 infrastructure/devops
● Core team (CTO + 8 devs) working on some enterprise/cloud features and
adapting core when needed. Part of the on-call
● CEO, and COO for legal, pricing, and business matters
● Tech writer, for docs
● Developer Advocate, for developer experience, feedback, and demos
● Head of Talent, for putting the team together

Some upcoming features for cloud
● Compression (actually, just released last week)
● Quickstart tutorial when launching a new instance
● Cold storage, moving data automatically to S3
● Role Based Access Control
● Horizontal scaling for reads (coming to QuestDB OSS as well)
● Horizontal scaling for writes
● Configurable alerting
● More single sign on choices
● VPC peering
● SOC2 compliance
● Adding new regions
● Azure support

https://github.com/questdb/questdb
https://questdb.io/cloud/

More info
https://cloud.questdb.com
https://demo.questdb.io
https://github.com/javier/questdb-quickstart
We 💕 contributions and ⭐ stars
github.com/questdb/questdb
THANKS!
Javier Ramirez, Head of Developer Relations at QuestDB, @supercoco9

Convertimos una DB open source en un SaaS multi-tenant usando K8s

Recommended

Recommended

More Related Content

Similar to Convertimos una DB open source en un SaaS multi-tenant usando K8s

Similar to Convertimos una DB open source en un SaaS multi-tenant usando K8s (20)

More from javier ramirez

More from javier ramirez (20)

Recently uploaded

Recently uploaded (20)

Convertimos una DB open source en un SaaS multi-tenant usando K8s