Cómo convertimos una DB open source
en un SaaS multi-tenant usando K8s
Javier Ramirez
@supercoco9
We would like to be known for
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)
Apache 2.0 License
Open Source Business Models
● Professional Services/Support
● Open Core
● Software as a Service
https://en.wikipedia.org/wiki/Business_models_for_open-source_software
Why we chose to launch
a managed service
QuestDB is simple to operate, but…
… Some companies ingest several thousands of events per second (some
of them up to 200 thousand per second)
… And expect predictable performance
… While running queries on top.
That’s not that simple anymore.
Also, some teams just prefer not to manage any infrastructure at all.
QuestDB from a DevOps perspective
● Single running process. Multiple interfaces
● Port 9000 for web interface and REST API
● Port 9003 for health check and prometheus metrics
● Port 9009 for ILP (fast ingestion, socket based)
● Port 8812 for PostgreSQL (pgwire) protocol
● conf, log, and db folders
○ db folder can be distributed across primary + secondary disk
● Commands for backup and data partitions lifecycle
The basics of a managed data service
● No operations (almost)
● Security everywhere
● Sensible defaults (depending on instance size)
● Access to config tuning
● Specialized support
● Management panel
● Backend admin website
● Monitoring dashboard and alerting
● Simple and quick Provisioning/deprovisioning
● Flexible sizing
● Managed upgrades
● Managed snapshots
● Multi regional availability
● Billing and payments
● User management/Single sign on
● Choice of different cloud providers
● Bring your own cloud/on-premises deployment
10
You can always
launch with less
We decided to launch a minimal, by invitation only, private beta.
Invited a customer every ~2 weeks, with fully functional QuestDB
instances, but with some parts of the cloud experience still under
development.
The basics of a managed data service*
● No operations (almost)
● Security everywhere
● Sensible defaults (depending on instance size)
● Access to config tuning
● Specialized support
● Management panel
● Backend admin website
● Monitoring dashboard and alerting
● Simple and quick Provisioning/deprovisioning
● Flexible sizing
● Managed upgrades
● Managed snapshots
● Multi regional availability
● Billing and payments
● User management/Single sign on
● Choice of different cloud providers
● Bring your own cloud/on-premises deployment
* Everything except the parts in
orange is already publicly available
Realise what you are not to do
● Hardware and low-level services (other than QuestDB) => AWS
● Metering usage of services => Metronome
● User authentication, single sign on => Auth0
● User payments => Stripe
● Global taxes => Anrok
The main/provisioner K8s cluster
● A single cluster (no multi region)
● Amazon ELB + Nginx for the cloud web interface
● Monitoring dashboards (powered by QuestDB) for all instances
● General user and instance management. Stored at HA Amazon RDS
PostgreSQL
● Handles provisioning of all the user instances, using Kafka as event broker
● Autoscaling is done with Karpenter
A tenant cluster per region
● System nodes
○ Shared resources (metric aggregation, alerting, certificate manager, AWS
Secrets, SSL termination, networking...)
○ Controlled by AWS. Autoscaled with Karpenter
● Instance nodes, added to the cluster but not controlled by AWS
○ Kubernetes operator (under development) for improved life cycle
○ Isolated with namespaces, k8s policies, and AWS policies
○ Runs QuestDB, collects metrics via Vector-dev and logs via Loki
○ Executes scheduled/ad-hoc incremental snapshots
○ Non shared EBS (gp3) volume
Tenant cluster (Region C)
Tenant cluster (Region B)
Main/provisioner K8s cluster
Tenant cluster (Region A)
System nodes
Instance
(customer)
nodes
PostgreSQL
QuestDB
Kafka
Loki
Prometheus
agent
Loki
Prometheus
agent
QuestDB
Vector.dev
Vector.dev
Snapshots
Snapshots
17
EBS/cloud disks are “slow”
https://demo.questdb.io
https://github.com/javier/questdb-quickstart
Creative cloud disk
management
Local Nvme drives are fast.
But they don’t survive instance restarts.
Creative cloud disk
management
Local Nvme drives are fast.
But they don’t survive instance restarts.
Option A. Write in parallel into two instances, one with local, one with
EBS. Read from the one with local disk. Also helps with HA.
Creative cloud disk
management
Local Nvme drives are fast.
But they don’t survive instance restarts.
Option A. Write in parallel into two instances, one with local, one with
EBS. Read from the one with local disk. Also helps with HA.
Option B. RAID 1 with a local and EBS disk. Write and read always
from local drive. Writes still slow, but very fast reads.
Creative cloud disk
management
Local Nvme drives are fast.
But they don’t survive instance restarts.
Option A. Write in parallel into two instances, one with local, one with
EBS. Read from the one with local disk. Also helps with HA.
Option B. RAID 1 with a local and EBS disk. Write and read always
from local drive. Writes still slow, but very fast reads.
Forked aws-ebs-csi-driver for specific disk issues on our instances.
SSL/TLS everywhere
● QuestDB console and REST API (HTTP)
○ Nginx
● Pgwire protocol (TCP/IP)
○ PgBouncer (considering envoy with pg module)
● ILP protocol (TCP socket)
○ HAProxy
Proxy and SSL/TLS
challenges
Added a bit of latency. Indiscernible for most use cases. Fined tuned
for QuestDB typical data ingestion patterns.
Proxy and SSL/TLS
challenges
Added a bit of latency. Indiscernible for most use cases. Fined tuned
for QuestDB typical data ingestion patterns.
Large imports using the REST API would break as Nginx tries to hold
the whole file in memory.
Proxy and SSL/TLS
challenges
Added a bit of latency. Indiscernible for most use cases. Fined tuned
for QuestDB typical data ingestion patterns.
Large imports using the REST API would break as Nginx tries to hold
the whole file in memory.
HAProxy by default starts a thread per CPU, even when in K8s. Had
performance issues until we noticed and configured accordingly.
Provisioning: three levels
● The main cluster (single one, but needed for dev environments)
● The tenant clusters (one for every supported region)
● The QuestDB customer instances
Provisioning the main cluster
● Terraform
○ Amazon EKS + EBS volumes
○ Amazon RDS
○ Amazon Managed Kafka
Provisioning the tenant/region clusters
● Terraform
○ Amazon EKS + EBS volumes
○ Manual configuration to add the
cluster on Argo CD for automatic
upgrades (only for production
clusters)
Provisioning the customer instances
● Customer initiates state change on control panel (running on main cluster)
○ Backend sends JSON message to Kafka
○ Backend running on tenant cluster receives Kafka message
■ Backend initiates change. (Via K8s operator soon)
■ Instance is restarted (if needed)
■ Backend sends event to mark change finished
■ Front-end control panel displays status change (if needed)
■ Backend on main cluster inserts change on Postgresql (if needed)
Upgrading the clusters
● Happens when
○ New version of control panel
○ New version of cloud backend
○ New version for dependencies
○ New QuestDB release
● Managed via
○ Argo CD listening to github
Upgrading QuestDB
● Once a new enterprise (or OSS, but legacy) release is available, Argo CD will make it
available from the control panel, both for new instances and as an optional
upgrade for existing ones
● Both instance creation and instance changes (including upgrades, changes on
instance or disk sizing, changes in running state, deletions, or configuration
changes) are done via the provisioning mechanism initiated with a Kafka message
Managed snapshots
● Manual snapshots are provisioned (using Kafka) and executed as a short-lived pod
on the customer instance. Storage taken by manual snapshots is billed separately
● Automatic scheduled snapshots are also provisioned using Kafka, and executed as
long-lived pods that wake up on schedule. Scheduled snapshots can be paused or
deleted, in which case the pods will also be paused or deleted. Schedule is by an
hourly range, and we can initiate at any point during the hour to limit concurrency.
The last 7 days of scheduled snapshots is included on instance base price.
Monitoring and
Observability
QuestDB exposes prometheus metrics. Vector.dev exports them.
We use Grafana for internal observability of every node and instance.
We ingest selected metrics into QuestDB to power customer dashboards (uplot).
Logs are exported via Loki. Also to S3 initially, but we are removing that.
We cross monitor the different K8s clusters.
Repositories (Go, Python, Typescript, yaml)
● QuestDB public repositories (questdb + ui)
● Enterprise QuestDB repository, adding
proprietary features
● Saas-infra
● Saas-client
● Saas-client-tests
● Saas-client-mocks
● Saas-client-deployer
● Saas-helm-charts
● Saas-backend
● Saas-exporter
● Questdb-operator
● Saas-operator
● Saas-provisioner
● Aws-ebc-csi-driver
● Auth0-actionscloud
● saas-admin
Operating the QuestDB cloud
● Almost everything is automated
● When there is an alert, send via pagerduty and slack
○ On-call/Team rotas
○ Playbooks for: Provisioner, Backend, Frontend, Ingress, Backup,
HAproxy, PGBouncer, and QuestDB
● Admin tasks are still very much manual, with some templates for
querying Grafana and PostgreSQL for common tasks
The (fully remote) team
● 1.5 front-enders
● 1.5 backenders (the other 0.5 from above)
● 2 infrastructure/devops
The (fully remote) extended team
1.5 front-enders
1.5 backenders (the other 0.5 from above)
2 infrastructure/devops
● Core team (CTO + 8 devs) working on some enterprise/cloud features and
adapting core when needed. Part of the on-call
● CEO, and COO for legal, pricing, and business matters
● Tech writer, for docs
● Developer Advocate, for developer experience, feedback, and demos
● Head of Talent, for putting the team together
Some upcoming features for cloud
● Compression (actually, just released last week)
● Quickstart tutorial when launching a new instance
● Cold storage, moving data automatically to S3
● Role Based Access Control
● Horizontal scaling for reads (coming to QuestDB OSS as well)
● Horizontal scaling for writes
● Configurable alerting
● More single sign on choices
● VPC peering
● SOC2 compliance
● Adding new regions
● Azure support
https://github.com/questdb/questdb
https://questdb.io/cloud/
More info
https://cloud.questdb.com
https://demo.questdb.io
https://github.com/javier/questdb-quickstart
We 💕 contributions and ⭐ stars
github.com/questdb/questdb
THANKS!
Javier Ramirez, Head of Developer Relations at QuestDB, @supercoco9

Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB, la base de datos de time series open source

  • 1.
    Cómo convertimos unaDB open source en un SaaS multi-tenant usando K8s Javier Ramirez @supercoco9
  • 3.
    We would liketo be known for ● Performance ○ Better performance with smaller machines ● Developer Experience ● Proudly Open Source (Apache 2.0)
  • 4.
  • 5.
    Open Source BusinessModels ● Professional Services/Support ● Open Core ● Software as a Service https://en.wikipedia.org/wiki/Business_models_for_open-source_software
  • 6.
    Why we choseto launch a managed service QuestDB is simple to operate, but… … Some companies ingest several thousands of events per second (some of them up to 200 thousand per second) … And expect predictable performance … While running queries on top. That’s not that simple anymore. Also, some teams just prefer not to manage any infrastructure at all.
  • 8.
    QuestDB from aDevOps perspective ● Single running process. Multiple interfaces ● Port 9000 for web interface and REST API ● Port 9003 for health check and prometheus metrics ● Port 9009 for ILP (fast ingestion, socket based) ● Port 8812 for PostgreSQL (pgwire) protocol ● conf, log, and db folders ○ db folder can be distributed across primary + secondary disk ● Commands for backup and data partitions lifecycle
  • 9.
    The basics ofa managed data service ● No operations (almost) ● Security everywhere ● Sensible defaults (depending on instance size) ● Access to config tuning ● Specialized support ● Management panel ● Backend admin website ● Monitoring dashboard and alerting ● Simple and quick Provisioning/deprovisioning ● Flexible sizing ● Managed upgrades ● Managed snapshots ● Multi regional availability ● Billing and payments ● User management/Single sign on ● Choice of different cloud providers ● Bring your own cloud/on-premises deployment
  • 10.
  • 11.
    You can always launchwith less We decided to launch a minimal, by invitation only, private beta. Invited a customer every ~2 weeks, with fully functional QuestDB instances, but with some parts of the cloud experience still under development.
  • 12.
    The basics ofa managed data service* ● No operations (almost) ● Security everywhere ● Sensible defaults (depending on instance size) ● Access to config tuning ● Specialized support ● Management panel ● Backend admin website ● Monitoring dashboard and alerting ● Simple and quick Provisioning/deprovisioning ● Flexible sizing ● Managed upgrades ● Managed snapshots ● Multi regional availability ● Billing and payments ● User management/Single sign on ● Choice of different cloud providers ● Bring your own cloud/on-premises deployment * Everything except the parts in orange is already publicly available
  • 13.
    Realise what youare not to do ● Hardware and low-level services (other than QuestDB) => AWS ● Metering usage of services => Metronome ● User authentication, single sign on => Auth0 ● User payments => Stripe ● Global taxes => Anrok
  • 14.
    The main/provisioner K8scluster ● A single cluster (no multi region) ● Amazon ELB + Nginx for the cloud web interface ● Monitoring dashboards (powered by QuestDB) for all instances ● General user and instance management. Stored at HA Amazon RDS PostgreSQL ● Handles provisioning of all the user instances, using Kafka as event broker ● Autoscaling is done with Karpenter
  • 15.
    A tenant clusterper region ● System nodes ○ Shared resources (metric aggregation, alerting, certificate manager, AWS Secrets, SSL termination, networking...) ○ Controlled by AWS. Autoscaled with Karpenter ● Instance nodes, added to the cluster but not controlled by AWS ○ Kubernetes operator (under development) for improved life cycle ○ Isolated with namespaces, k8s policies, and AWS policies ○ Runs QuestDB, collects metrics via Vector-dev and logs via Loki ○ Executes scheduled/ad-hoc incremental snapshots ○ Non shared EBS (gp3) volume
  • 16.
    Tenant cluster (RegionC) Tenant cluster (Region B) Main/provisioner K8s cluster Tenant cluster (Region A) System nodes Instance (customer) nodes PostgreSQL QuestDB Kafka Loki Prometheus agent Loki Prometheus agent QuestDB Vector.dev Vector.dev Snapshots Snapshots
  • 17.
  • 18.
    EBS/cloud disks are“slow” https://demo.questdb.io https://github.com/javier/questdb-quickstart
  • 19.
    Creative cloud disk management LocalNvme drives are fast. But they don’t survive instance restarts.
  • 20.
    Creative cloud disk management LocalNvme drives are fast. But they don’t survive instance restarts. Option A. Write in parallel into two instances, one with local, one with EBS. Read from the one with local disk. Also helps with HA.
  • 21.
    Creative cloud disk management LocalNvme drives are fast. But they don’t survive instance restarts. Option A. Write in parallel into two instances, one with local, one with EBS. Read from the one with local disk. Also helps with HA. Option B. RAID 1 with a local and EBS disk. Write and read always from local drive. Writes still slow, but very fast reads.
  • 22.
    Creative cloud disk management LocalNvme drives are fast. But they don’t survive instance restarts. Option A. Write in parallel into two instances, one with local, one with EBS. Read from the one with local disk. Also helps with HA. Option B. RAID 1 with a local and EBS disk. Write and read always from local drive. Writes still slow, but very fast reads. Forked aws-ebs-csi-driver for specific disk issues on our instances.
  • 23.
    SSL/TLS everywhere ● QuestDBconsole and REST API (HTTP) ○ Nginx ● Pgwire protocol (TCP/IP) ○ PgBouncer (considering envoy with pg module) ● ILP protocol (TCP socket) ○ HAProxy
  • 25.
    Proxy and SSL/TLS challenges Addeda bit of latency. Indiscernible for most use cases. Fined tuned for QuestDB typical data ingestion patterns.
  • 26.
    Proxy and SSL/TLS challenges Addeda bit of latency. Indiscernible for most use cases. Fined tuned for QuestDB typical data ingestion patterns. Large imports using the REST API would break as Nginx tries to hold the whole file in memory.
  • 27.
    Proxy and SSL/TLS challenges Addeda bit of latency. Indiscernible for most use cases. Fined tuned for QuestDB typical data ingestion patterns. Large imports using the REST API would break as Nginx tries to hold the whole file in memory. HAProxy by default starts a thread per CPU, even when in K8s. Had performance issues until we noticed and configured accordingly.
  • 28.
    Provisioning: three levels ●The main cluster (single one, but needed for dev environments) ● The tenant clusters (one for every supported region) ● The QuestDB customer instances
  • 29.
    Provisioning the maincluster ● Terraform ○ Amazon EKS + EBS volumes ○ Amazon RDS ○ Amazon Managed Kafka
  • 30.
    Provisioning the tenant/regionclusters ● Terraform ○ Amazon EKS + EBS volumes ○ Manual configuration to add the cluster on Argo CD for automatic upgrades (only for production clusters)
  • 31.
    Provisioning the customerinstances ● Customer initiates state change on control panel (running on main cluster) ○ Backend sends JSON message to Kafka ○ Backend running on tenant cluster receives Kafka message ■ Backend initiates change. (Via K8s operator soon) ■ Instance is restarted (if needed) ■ Backend sends event to mark change finished ■ Front-end control panel displays status change (if needed) ■ Backend on main cluster inserts change on Postgresql (if needed)
  • 32.
    Upgrading the clusters ●Happens when ○ New version of control panel ○ New version of cloud backend ○ New version for dependencies ○ New QuestDB release ● Managed via ○ Argo CD listening to github
  • 33.
    Upgrading QuestDB ● Oncea new enterprise (or OSS, but legacy) release is available, Argo CD will make it available from the control panel, both for new instances and as an optional upgrade for existing ones ● Both instance creation and instance changes (including upgrades, changes on instance or disk sizing, changes in running state, deletions, or configuration changes) are done via the provisioning mechanism initiated with a Kafka message
  • 34.
    Managed snapshots ● Manualsnapshots are provisioned (using Kafka) and executed as a short-lived pod on the customer instance. Storage taken by manual snapshots is billed separately ● Automatic scheduled snapshots are also provisioned using Kafka, and executed as long-lived pods that wake up on schedule. Scheduled snapshots can be paused or deleted, in which case the pods will also be paused or deleted. Schedule is by an hourly range, and we can initiate at any point during the hour to limit concurrency. The last 7 days of scheduled snapshots is included on instance base price.
  • 35.
    Monitoring and Observability QuestDB exposesprometheus metrics. Vector.dev exports them. We use Grafana for internal observability of every node and instance. We ingest selected metrics into QuestDB to power customer dashboards (uplot). Logs are exported via Loki. Also to S3 initially, but we are removing that. We cross monitor the different K8s clusters.
  • 36.
    Repositories (Go, Python,Typescript, yaml) ● QuestDB public repositories (questdb + ui) ● Enterprise QuestDB repository, adding proprietary features ● Saas-infra ● Saas-client ● Saas-client-tests ● Saas-client-mocks ● Saas-client-deployer ● Saas-helm-charts ● Saas-backend ● Saas-exporter ● Questdb-operator ● Saas-operator ● Saas-provisioner ● Aws-ebc-csi-driver ● Auth0-actionscloud ● saas-admin
  • 38.
    Operating the QuestDBcloud ● Almost everything is automated ● When there is an alert, send via pagerduty and slack ○ On-call/Team rotas ○ Playbooks for: Provisioner, Backend, Frontend, Ingress, Backup, HAproxy, PGBouncer, and QuestDB ● Admin tasks are still very much manual, with some templates for querying Grafana and PostgreSQL for common tasks
  • 39.
    The (fully remote)team ● 1.5 front-enders ● 1.5 backenders (the other 0.5 from above) ● 2 infrastructure/devops
  • 40.
    The (fully remote)extended team 1.5 front-enders 1.5 backenders (the other 0.5 from above) 2 infrastructure/devops ● Core team (CTO + 8 devs) working on some enterprise/cloud features and adapting core when needed. Part of the on-call ● CEO, and COO for legal, pricing, and business matters ● Tech writer, for docs ● Developer Advocate, for developer experience, feedback, and demos ● Head of Talent, for putting the team together
  • 41.
    Some upcoming featuresfor cloud ● Compression (actually, just released last week) ● Quickstart tutorial when launching a new instance ● Cold storage, moving data automatically to S3 ● Role Based Access Control ● Horizontal scaling for reads (coming to QuestDB OSS as well) ● Horizontal scaling for writes ● Configurable alerting ● More single sign on choices ● VPC peering ● SOC2 compliance ● Adding new regions ● Azure support
  • 42.
  • 43.
    More info https://cloud.questdb.com https://demo.questdb.io https://github.com/javier/questdb-quickstart We 💕contributions and ⭐ stars github.com/questdb/questdb THANKS! Javier Ramirez, Head of Developer Relations at QuestDB, @supercoco9