(talk delivered at OSA CON 23)
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed.
We will learn how it deals with data ingestion, and which SQL extensions it implements for working with time-series efficiently.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or data deduplication.
7. Working with timestamped data in a
database is tricky*
* specially working with analytics of data changing over time or at a high rate
8. If you can use only one
database for everything, go
with PostgreSQL*
* Or any other major and well supported RDBMS
9. Some things RDBMS are not designed for
● Writing data faster than it is read (several millions of inserts per day and faster)
● Aggregations scoped to different time units (per year/minute/microsecond)
● Identifying gaps or missing data for a given interval
● Joining tables by approximate timestamp
● Sparse data (tables with hundreds or thousands of columns)
● Aggregates over billions of records
10.
11. ● a factory floor with 500 machines, or
● a fleet with 500 vehicles, or
● 50 trains, with 10 cars each, or
● 500 users with a mobile phone
Sending data every second
How I made my first billion
15. 43,200,000 rows a day…….
302,400,000 rows a week….
1,314,144,000 rows a month
How I made my first billion
* See? On streaming data, It is kind of easy to get your first billion of data points
17. Time-series database basics
● Optimised for fast append-only ingestion
● Data lifecycle policies
● Analytics over chunks of time
● Time-based aggregations
● Often power real-time dashboards
18.
19. QuestDB would like to be known for:
● Performance
○ Also with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)
21. QuestDB ingestion and storage layer
● Data always stored by incremental timestamp.
● Data partitioned by time units and stored in columnar format.
● No indexes needed. Data is immediately available after writing.
● Predictable ingestion rate, even under demanding workloads (millions/second).
● Built-in event deduplication.
● Optimized data types (Symbol, geohash, ipv4, uuid).
● Row updates and upserts supported.
22. Lifecycle policies
ALTER TABLE my_table DROP PARTITION LIST '2021-01-01', '2021-01-02';
--Delete days before 2021-01-03
ALTER TABLE my_table DROP PARTITION WHERE timestamp < to_timestamp('2021-01-03', 'yyyy-MM-dd');
ALTER TABLE x DETACH PARTITION LIST '2019-02-01', '2019-02-02';
-- It is also possible to use WHERE clause to define the partition list
ALTER TABLE sensors DETACH PARTITION WHERE < '2019-02-03T00';
CREATE TABLE my_table (i symbol, ts timestamp) IN VOLUME SECONDARY_VOLUME;
23. Connectivity, protocols, and interfaces
● REST API and web console: Query execution, CSV imports/exports. Basic charts.
● Pgwire: perfect for querying, DDL, and DML. Ingestion supported, up to moderate
throughput. Compatible with any low-level postgresql client or library.
● Influx Line Protocol(ILP): socket-based, ingestion only, very high throughput. Official clients
available for C/C++, JAVA, Python, Rust, Go, NodeJS, and .Net.
● Health/Metrics: HTTP endpoint with Prometheus format
● Integrations with: Apache Kafka, Apache Flink, Apache Spark, Python Pandas, Grafana,
Superset, Telegraf, Redpanda, qStudio, SQLAlchemy, Cube…
25. QuestDB Query engine internals
● Our Java codebase has zero dependencies. No garbage collection on
the hot path. As close to the hardware as possible.
● We research the latest trends. Our code takes advantage of the
state-of-the-art in CPU, storage design, and data structures.
● We implement our own Just in Time Compiler to make query execution
as parallel and fast as possible.
● We spend weeks of development to save microseconds or
nanoseconds in many operations.
27. LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple
time series are stored in the same table.
SELECT * FROM trades
LATEST ON timestamp PARTITION BY symbol;
Try it live on
https://demo.questdb.io
28. LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple
time series are stored in the same table.
SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol, side;
Try it live on
https://demo.questdb.io
29. SAMPLE BY
Aggregates data in homogeneous time chunks
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 15m ALIGN TO CALENDAR;
SELECT timestamp, min(tempF),
max(tempF), avg(tempF)
FROM weather SAMPLE BY 1M;
Try it live on
https://demo.questdb.io
30. How do you ask your database to
return which data is not stored?
31. I am sending data every second or
so. Tell me which devices didn’t
send any data with more than 1.5
seconds gap
32. SAMPLE BY … FILL
Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value)
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
Try it live on
https://demo.questdb.io
33. WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
Try it live on
https://demo.questdb.io
34. WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
Try it live on
https://demo.questdb.io
35. WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7'
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'
Try it live on
https://demo.questdb.io
36. What if I have two tables, where
data is (obviously) not sent at the
same exact timestamps and I want
to join by closest matching
timestamp?
37. ASOF JOIN (LT JOIN and SPLICE JOIN variations)
ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF JOIN takes from
the second time-series a timestamp that meets both of the following criteria:
● The timestamp is the closest to the first timestamp.
● The timestamp is strictly prior or equal to the first timestamp.
WITH trips2018 AS (
SELECT * from trips WHERE pickup_datetime in '2016'
)
SELECT pickup_datetime, timestamp, fare_amount, tempF, windDir
FROM trips2018
ASOF JOIN weather;
Try it live on
https://demo.questdb.io
38. Some things we are trying out next for performance
● Compression, and exploring data formats like arrow/ parquet
● Own ingestion protocol
● Second level partitioning
● Improved vectorization of some operations (group by multiple columns or by expressions
● Add specific joins optimizations (index nested loop joins, for example)
39. QuestDB OSS
Open Source. Self-managed. Suitable for
production workloads.
https://github.com/questdb/questdb
QuestDB Enterprise
Licensed. Self-managed. Enterprise features like
RBAC, compression, replication, TLS on all
protocols, cold storage, K8s operator…
https://questdb.io/enterprise/
QuestDB Cloud
Fully managed, pay per usage environment,
with enterprise-grade features.
https://questdb.io/cloud/
40. OSA CON | December 12-14 2023
Q&A
● github.com/questdb/questdb
● https://questdb.io
● https://demo.questdb.io
● https://github.com/javier/questdb-quickstart
● https://slack.questdb.io/
40
Javier Ramirez
@supercoco9
We 💕 contributions
and GitHub ⭐ stars