QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas.
Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.
5. Open Source Business Models
● Professional Services/Support
● Open Core
● Software as a Service
https://en.wikipedia.org/wiki/Business_models_for_open-source_software
6. Why we chose to launch
a managed service
QuestDB is simple to operate, but…
… Some companies ingest several thousands of events per second (some
of them up to 200 thousand per second)
… And expect predictable performance
… While running queries on top.
That’s not that simple anymore.
Also, some teams just prefer not to manage any infrastructure at all.
7.
8. QuestDB from a DevOps perspective
● Single running process. Multiple interfaces
● Port 9000 for web interface and REST API
● Port 9003 for health check and prometheus metrics
● Port 9009 for ILP (fast ingestion, socket based)
● Port 8812 for PostgreSQL (pgwire) protocol
● conf, log, and db folders
○ db folder can be distributed across primary + secondary disk
● Commands for backup and data partitions lifecycle
9. The basics of a managed data service
● No operations (almost)
● Security everywhere
● Sensible defaults (depending on instance size)
● Access to config tuning
● Specialized support
● Management panel
● Backend admin website
● Monitoring dashboard and alerting
● Simple and quick Provisioning/deprovisioning
● Flexible sizing
● Managed upgrades
● Managed snapshots
● Multi regional availability
● Billing and payments
● User management/Single sign on
● Choice of different cloud providers
● Bring your own cloud/on-premises deployment
11. You can always
launch with less
We decided to launch a minimal, by invitation only, private beta.
Invited a customer every ~2 weeks, with fully functional QuestDB
instances, but with some parts of the cloud experience still under
development.
12. The basics of a managed data service*
● No operations (almost)
● Security everywhere
● Sensible defaults (depending on instance size)
● Access to config tuning
● Specialized support
● Management panel
● Backend admin website
● Monitoring dashboard and alerting
● Simple and quick Provisioning/deprovisioning
● Flexible sizing
● Managed upgrades
● Managed snapshots
● Multi regional availability
● Billing and payments
● User management/Single sign on
● Choice of different cloud providers
● Bring your own cloud/on-premises deployment
* Everything except the parts in
orange is already publicly available
13. Realise what you are not to do
● Hardware and low-level services (other than QuestDB) => AWS
● Metering usage of services => Metronome
● User authentication, single sign on => Auth0
● User payments => Stripe
● Global taxes => Anrok
14. The main/provisioner K8s cluster
● A single cluster (no multi region)
● Amazon ELB + Nginx for the cloud web interface
● Monitoring dashboards (powered by QuestDB) for all instances
● General user and instance management. Stored at HA Amazon RDS
PostgreSQL
● Handles provisioning of all the user instances, using Kafka as event broker
● Autoscaling is done with Karpenter
15. A tenant cluster per region
● System nodes
○ Shared resources (metric aggregation, alerting, certificate manager, AWS
Secrets, SSL termination, networking...)
○ Controlled by AWS. Autoscaled with Karpenter
● Instance nodes, added to the cluster but not controlled by AWS
○ Kubernetes operator (under development) for improved life cycle
○ Isolated with namespaces, k8s policies, and AWS policies
○ Runs QuestDB, collects metrics via Vector-dev and logs via Loki
○ Executes scheduled/ad-hoc incremental snapshots
○ Non shared EBS (gp3) volume
16. Tenant cluster (Region C)
Tenant cluster (Region B)
Main/provisioner K8s cluster
Tenant cluster (Region A)
System nodes
Instance
(customer)
nodes
PostgreSQL
QuestDB
Kafka
Loki
Prometheus
agent
Loki
Prometheus
agent
QuestDB
Vector.dev
Vector.dev
Snapshots
Snapshots
20. Creative cloud disk
management
Local Nvme drives are fast.
But they don’t survive instance restarts.
Option A. Write in parallel into two instances, one with local, one with
EBS. Read from the one with local disk. Also helps with HA.
21. Creative cloud disk
management
Local Nvme drives are fast.
But they don’t survive instance restarts.
Option A. Write in parallel into two instances, one with local, one with
EBS. Read from the one with local disk. Also helps with HA.
Option B. RAID 1 with a local and EBS disk. Write and read always
from local drive. Writes still slow, but very fast reads.
22. Creative cloud disk
management
Local Nvme drives are fast.
But they don’t survive instance restarts.
Option A. Write in parallel into two instances, one with local, one with
EBS. Read from the one with local disk. Also helps with HA.
Option B. RAID 1 with a local and EBS disk. Write and read always
from local drive. Writes still slow, but very fast reads.
Forked aws-ebs-csi-driver for specific disk issues on our instances.
23. SSL/TLS everywhere
● QuestDB console and REST API (HTTP)
○ Nginx
● Pgwire protocol (TCP/IP)
○ PgBouncer (considering envoy with pg module)
● ILP protocol (TCP socket)
○ HAProxy
24.
25. Proxy and SSL/TLS
challenges
Added a bit of latency. Indiscernible for most use cases. Fined tuned
for QuestDB typical data ingestion patterns.
26. Proxy and SSL/TLS
challenges
Added a bit of latency. Indiscernible for most use cases. Fined tuned
for QuestDB typical data ingestion patterns.
Large imports using the REST API would break as Nginx tries to hold
the whole file in memory.
27. Proxy and SSL/TLS
challenges
Added a bit of latency. Indiscernible for most use cases. Fined tuned
for QuestDB typical data ingestion patterns.
Large imports using the REST API would break as Nginx tries to hold
the whole file in memory.
HAProxy by default starts a thread per CPU, even when in K8s. Had
performance issues until we noticed and configured accordingly.
28. Provisioning: three levels
● The main cluster (single one, but needed for dev environments)
● The tenant clusters (one for every supported region)
● The QuestDB customer instances
29. Provisioning the main cluster
● Terraform
○ Amazon EKS + EBS volumes
○ Amazon RDS
○ Amazon Managed Kafka
30. Provisioning the tenant/region clusters
● Terraform
○ Amazon EKS + EBS volumes
○ Manual configuration to add the
cluster on Argo CD for automatic
upgrades (only for production
clusters)
31. Provisioning the customer instances
● Customer initiates state change on control panel (running on main cluster)
○ Backend sends JSON message to Kafka
○ Backend running on tenant cluster receives Kafka message
■ Backend initiates change. (Via K8s operator soon)
■ Instance is restarted (if needed)
■ Backend sends event to mark change finished
■ Front-end control panel displays status change (if needed)
■ Backend on main cluster inserts change on Postgresql (if needed)
32. Upgrading the clusters
● Happens when
○ New version of control panel
○ New version of cloud backend
○ New version for dependencies
○ New QuestDB release
● Managed via
○ Argo CD listening to github
33. Upgrading QuestDB
● Once a new enterprise (or OSS, but legacy) release is available, Argo CD will make it
available from the control panel, both for new instances and as an optional
upgrade for existing ones
● Both instance creation and instance changes (including upgrades, changes on
instance or disk sizing, changes in running state, deletions, or configuration
changes) are done via the provisioning mechanism initiated with a Kafka message
34. Managed snapshots
● Manual snapshots are provisioned (using Kafka) and executed as a short-lived pod
on the customer instance. Storage taken by manual snapshots is billed separately
● Automatic scheduled snapshots are also provisioned using Kafka, and executed as
long-lived pods that wake up on schedule. Scheduled snapshots can be paused or
deleted, in which case the pods will also be paused or deleted. Schedule is by an
hourly range, and we can initiate at any point during the hour to limit concurrency.
The last 7 days of scheduled snapshots is included on instance base price.
35. Monitoring and
Observability
QuestDB exposes prometheus metrics. Vector.dev exports them.
We use Grafana for internal observability of every node and instance.
We ingest selected metrics into QuestDB to power customer dashboards (uplot).
Logs are exported via Loki. Also to S3 initially, but we are removing that.
We cross monitor the different K8s clusters.
38. Operating the QuestDB cloud
● Almost everything is automated
● When there is an alert, send via pagerduty and slack
○ On-call/Team rotas
○ Playbooks for: Provisioner, Backend, Frontend, Ingress, Backup,
HAproxy, PGBouncer, and QuestDB
● Admin tasks are still very much manual, with some templates for
querying Grafana and PostgreSQL for common tasks
39. The (fully remote) team
● 1.5 front-enders
● 1.5 backenders (the other 0.5 from above)
● 2 infrastructure/devops
40. The (fully remote) extended team
1.5 front-enders
1.5 backenders (the other 0.5 from above)
2 infrastructure/devops
● Core team (CTO + 8 devs) working on some enterprise/cloud features and
adapting core when needed. Part of the on-call
● CEO, and COO for legal, pricing, and business matters
● Tech writer, for docs
● Developer Advocate, for developer experience, feedback, and demos
● Head of Talent, for putting the team together
41. Some upcoming features for cloud
● Compression (actually, just released last week)
● Quickstart tutorial when launching a new instance
● Cold storage, moving data automatically to S3
● Role Based Access Control
● Horizontal scaling for reads (coming to QuestDB OSS as well)
● Horizontal scaling for writes
● Configurable alerting
● More single sign on choices
● VPC peering
● SOC2 compliance
● Adding new regions
● Azure support