December 12-14 2023 December 12-14 2023
The Building Blocks of a
Time-Series database
Javier Ramirez
Database advocate
@supercoco9
Timestamp problems are hard
https://twitter.com/rshidashrf/status/975566803052134400
https://stackoverflow.com/questions/6841333/why-is-subtracting-these-two-epoch-milli-times-in-year-1927-giving-a-strange-r
1553983200
* This is a date. Which one?
When is this?
1/4/19
* is this April 19th? january 4th? April 1st?
Working with timestamped data in a
database is tricky*
* specially working with analytics of data changing over time or at a high rate
If you can use only one
database for everything, go
with PostgreSQL*
* Or any other major and well supported RDBMS
Some things RDBMS are not designed for
● Writing data faster than it is read (several millions of inserts per day and faster)
● Aggregations scoped to different time units (per year/minute/microsecond)
● Identifying gaps or missing data for a given interval
● Joining tables by approximate timestamp
● Sparse data (tables with hundreds or thousands of columns)
● Aggregates over billions of records
● a factory floor with 500 machines, or
● a fleet with 500 vehicles, or
● 50 trains, with 10 cars each, or
● 500 users with a mobile phone
Sending data every second
How I made my first billion
86,400
* Seconds in one day
604,800
* Seconds in one week
2,628,288
* Seconds in one month. Well, in the average month of 30.42 days anyway
43,200,000 rows a day…….
302,400,000 rows a week….
1,314,144,000 rows a month
How I made my first billion
* See? On streaming data, It is kind of easy to get your first billion of data points
Not all data
problems are
the same
Time-series database basics
● Optimised for fast append-only ingestion
● Data lifecycle policies
● Analytics over chunks of time
● Time-based aggregations
● Often power real-time dashboards
QuestDB would like to be known for:
● Performance
○ Also with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)
Fast streaming
ingestion
* You can try ingesting streaming data using https://github.com/javier/questdb-quickstart
QuestDB ingestion and storage layer
● Data always stored by incremental timestamp.
● Data partitioned by time units and stored in columnar format.
● No indexes needed. Data is immediately available after writing.
● Predictable ingestion rate, even under demanding workloads (millions/second).
● Built-in event deduplication.
● Optimized data types (Symbol, geohash, ipv4, uuid).
● Row updates and upserts supported.
Lifecycle policies
ALTER TABLE my_table DROP PARTITION LIST '2021-01-01', '2021-01-02';
--Delete days before 2021-01-03
ALTER TABLE my_table DROP PARTITION WHERE timestamp < to_timestamp('2021-01-03', 'yyyy-MM-dd');
ALTER TABLE x DETACH PARTITION LIST '2019-02-01', '2019-02-02';
-- It is also possible to use WHERE clause to define the partition list
ALTER TABLE sensors DETACH PARTITION WHERE < '2019-02-03T00';
CREATE TABLE my_table (i symbol, ts timestamp) IN VOLUME SECONDARY_VOLUME;
Connectivity, protocols, and interfaces
● REST API and web console: Query execution, CSV imports/exports. Basic charts.
● Pgwire: perfect for querying, DDL, and DML. Ingestion supported, up to moderate
throughput. Compatible with any low-level postgresql client or library.
● Influx Line Protocol(ILP): socket-based, ingestion only, very high throughput. Official clients
available for C/C++, JAVA, Python, Rust, Go, NodeJS, and .Net.
● Health/Metrics: HTTP endpoint with Prometheus format
● Integrations with: Apache Kafka, Apache Flink, Apache Spark, Python Pandas, Grafana,
Superset, Telegraf, Redpanda, qStudio, SQLAlchemy, Cube…
The query engine
QuestDB Query engine internals
● Our Java codebase has zero dependencies. No garbage collection on
the hot path. As close to the hardware as possible.
● We research the latest trends. Our code takes advantage of the
state-of-the-art in CPU, storage design, and data structures.
● We implement our own Just in Time Compiler to make query execution
as parallel and fast as possible.
● We spend weeks of development to save microseconds or
nanoseconds in many operations.
The query language: SQL with
time-series extensions
LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple
time series are stored in the same table.
SELECT * FROM trades
LATEST ON timestamp PARTITION BY symbol;
Try it live on
https://demo.questdb.io
LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple
time series are stored in the same table.
SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol, side;
Try it live on
https://demo.questdb.io
SAMPLE BY
Aggregates data in homogeneous time chunks
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 15m ALIGN TO CALENDAR;
SELECT timestamp, min(tempF),
max(tempF), avg(tempF)
FROM weather SAMPLE BY 1M;
Try it live on
https://demo.questdb.io
How do you ask your database to
return which data is not stored?
I am sending data every second or
so. Tell me which devices didn’t
send any data with more than 1.5
seconds gap
SAMPLE BY … FILL
Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value)
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
Try it live on
https://demo.questdb.io
WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
Try it live on
https://demo.questdb.io
WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
Try it live on
https://demo.questdb.io
WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7'
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'
Try it live on
https://demo.questdb.io
What if I have two tables, where
data is (obviously) not sent at the
same exact timestamps and I want
to join by closest matching
timestamp?
ASOF JOIN (LT JOIN and SPLICE JOIN variations)
ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF JOIN takes from
the second time-series a timestamp that meets both of the following criteria:
● The timestamp is the closest to the first timestamp.
● The timestamp is strictly prior or equal to the first timestamp.
WITH trips2018 AS (
SELECT * from trips WHERE pickup_datetime in '2016'
)
SELECT pickup_datetime, timestamp, fare_amount, tempF, windDir
FROM trips2018
ASOF JOIN weather;
Try it live on
https://demo.questdb.io
Some things we are trying out next for performance
● Compression, and exploring data formats like arrow/ parquet
● Own ingestion protocol
● Second level partitioning
● Improved vectorization of some operations (group by multiple columns or by expressions
● Add specific joins optimizations (index nested loop joins, for example)
QuestDB OSS
Open Source. Self-managed. Suitable for
production workloads.
https://github.com/questdb/questdb
QuestDB Enterprise
Licensed. Self-managed. Enterprise features like
RBAC, compression, replication, TLS on all
protocols, cold storage, K8s operator…
https://questdb.io/enterprise/
QuestDB Cloud
Fully managed, pay per usage environment,
with enterprise-grade features.
https://questdb.io/cloud/
OSA CON | December 12-14 2023
Q&A
● github.com/questdb/questdb
● https://questdb.io
● https://demo.questdb.io
● https://github.com/javier/questdb-quickstart
● https://slack.questdb.io/
40
Javier Ramirez
@supercoco9
We 💕 contributions
and GitHub ⭐ stars

QuestDB: The building blocks of a fast open-source time-series database

  • 1.
    December 12-14 2023December 12-14 2023 The Building Blocks of a Time-Series database Javier Ramirez Database advocate @supercoco9
  • 2.
  • 3.
  • 4.
  • 5.
    1553983200 * This isa date. Which one?
  • 6.
    When is this? 1/4/19 *is this April 19th? january 4th? April 1st?
  • 7.
    Working with timestampeddata in a database is tricky* * specially working with analytics of data changing over time or at a high rate
  • 8.
    If you canuse only one database for everything, go with PostgreSQL* * Or any other major and well supported RDBMS
  • 9.
    Some things RDBMSare not designed for ● Writing data faster than it is read (several millions of inserts per day and faster) ● Aggregations scoped to different time units (per year/minute/microsecond) ● Identifying gaps or missing data for a given interval ● Joining tables by approximate timestamp ● Sparse data (tables with hundreds or thousands of columns) ● Aggregates over billions of records
  • 11.
    ● a factoryfloor with 500 machines, or ● a fleet with 500 vehicles, or ● 50 trains, with 10 cars each, or ● 500 users with a mobile phone Sending data every second How I made my first billion
  • 12.
  • 13.
  • 14.
    2,628,288 * Seconds inone month. Well, in the average month of 30.42 days anyway
  • 15.
    43,200,000 rows aday……. 302,400,000 rows a week…. 1,314,144,000 rows a month How I made my first billion * See? On streaming data, It is kind of easy to get your first billion of data points
  • 16.
  • 17.
    Time-series database basics ●Optimised for fast append-only ingestion ● Data lifecycle policies ● Analytics over chunks of time ● Time-based aggregations ● Often power real-time dashboards
  • 19.
    QuestDB would liketo be known for: ● Performance ○ Also with smaller machines ● Developer Experience ● Proudly Open Source (Apache 2.0)
  • 20.
    Fast streaming ingestion * Youcan try ingesting streaming data using https://github.com/javier/questdb-quickstart
  • 21.
    QuestDB ingestion andstorage layer ● Data always stored by incremental timestamp. ● Data partitioned by time units and stored in columnar format. ● No indexes needed. Data is immediately available after writing. ● Predictable ingestion rate, even under demanding workloads (millions/second). ● Built-in event deduplication. ● Optimized data types (Symbol, geohash, ipv4, uuid). ● Row updates and upserts supported.
  • 22.
    Lifecycle policies ALTER TABLEmy_table DROP PARTITION LIST '2021-01-01', '2021-01-02'; --Delete days before 2021-01-03 ALTER TABLE my_table DROP PARTITION WHERE timestamp < to_timestamp('2021-01-03', 'yyyy-MM-dd'); ALTER TABLE x DETACH PARTITION LIST '2019-02-01', '2019-02-02'; -- It is also possible to use WHERE clause to define the partition list ALTER TABLE sensors DETACH PARTITION WHERE < '2019-02-03T00'; CREATE TABLE my_table (i symbol, ts timestamp) IN VOLUME SECONDARY_VOLUME;
  • 23.
    Connectivity, protocols, andinterfaces ● REST API and web console: Query execution, CSV imports/exports. Basic charts. ● Pgwire: perfect for querying, DDL, and DML. Ingestion supported, up to moderate throughput. Compatible with any low-level postgresql client or library. ● Influx Line Protocol(ILP): socket-based, ingestion only, very high throughput. Official clients available for C/C++, JAVA, Python, Rust, Go, NodeJS, and .Net. ● Health/Metrics: HTTP endpoint with Prometheus format ● Integrations with: Apache Kafka, Apache Flink, Apache Spark, Python Pandas, Grafana, Superset, Telegraf, Redpanda, qStudio, SQLAlchemy, Cube…
  • 24.
  • 25.
    QuestDB Query engineinternals ● Our Java codebase has zero dependencies. No garbage collection on the hot path. As close to the hardware as possible. ● We research the latest trends. Our code takes advantage of the state-of-the-art in CPU, storage design, and data structures. ● We implement our own Just in Time Compiler to make query execution as parallel and fast as possible. ● We spend weeks of development to save microseconds or nanoseconds in many operations.
  • 26.
    The query language:SQL with time-series extensions
  • 27.
    LATEST ON …PARTITION BY … Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple time series are stored in the same table. SELECT * FROM trades LATEST ON timestamp PARTITION BY symbol; Try it live on https://demo.questdb.io
  • 28.
    LATEST ON …PARTITION BY … Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple time series are stored in the same table. SELECT * FROM trades WHERE symbol in ('BTC-USD', 'ETH-USD') LATEST ON timestamp PARTITION BY symbol, side; Try it live on https://demo.questdb.io
  • 29.
    SAMPLE BY Aggregates datain homogeneous time chunks SELECT timestamp, sum(price * amount) / sum(amount) AS vwap_price, sum(amount) AS volume FROM trades WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now()) SAMPLE BY 15m ALIGN TO CALENDAR; SELECT timestamp, min(tempF), max(tempF), avg(tempF) FROM weather SAMPLE BY 1M; Try it live on https://demo.questdb.io
  • 30.
    How do youask your database to return which data is not stored?
  • 31.
    I am sendingdata every second or so. Tell me which devices didn’t send any data with more than 1.5 seconds gap
  • 32.
    SAMPLE BY …FILL Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value) SELECT timestamp, sum(price * amount) / sum(amount) AS vwap_price, sum(amount) AS volume FROM trades WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now()) SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR; Try it live on https://demo.questdb.io
  • 33.
    WHERE … TIMERANGE SELECT * from trips WHERE pickup_datetime in '2018'; SELECT * from trips WHERE pickup_datetime in '2018-06'; SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59'; Try it live on https://demo.questdb.io
  • 34.
    WHERE … TIMERANGE SELECT * from trips WHERE pickup_datetime in '2018'; SELECT * from trips WHERE pickup_datetime in '2018-06'; SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59'; SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10; Try it live on https://demo.questdb.io
  • 35.
    WHERE … TIMERANGE SELECT * from trips WHERE pickup_datetime in '2018'; SELECT * from trips WHERE pickup_datetime in '2018-06'; SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59'; SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7' SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7' Try it live on https://demo.questdb.io
  • 36.
    What if Ihave two tables, where data is (obviously) not sent at the same exact timestamps and I want to join by closest matching timestamp?
  • 37.
    ASOF JOIN (LTJOIN and SPLICE JOIN variations) ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF JOIN takes from the second time-series a timestamp that meets both of the following criteria: ● The timestamp is the closest to the first timestamp. ● The timestamp is strictly prior or equal to the first timestamp. WITH trips2018 AS ( SELECT * from trips WHERE pickup_datetime in '2016' ) SELECT pickup_datetime, timestamp, fare_amount, tempF, windDir FROM trips2018 ASOF JOIN weather; Try it live on https://demo.questdb.io
  • 38.
    Some things weare trying out next for performance ● Compression, and exploring data formats like arrow/ parquet ● Own ingestion protocol ● Second level partitioning ● Improved vectorization of some operations (group by multiple columns or by expressions ● Add specific joins optimizations (index nested loop joins, for example)
  • 39.
    QuestDB OSS Open Source.Self-managed. Suitable for production workloads. https://github.com/questdb/questdb QuestDB Enterprise Licensed. Self-managed. Enterprise features like RBAC, compression, replication, TLS on all protocols, cold storage, K8s operator… https://questdb.io/enterprise/ QuestDB Cloud Fully managed, pay per usage environment, with enterprise-grade features. https://questdb.io/cloud/
  • 40.
    OSA CON |December 12-14 2023 Q&A ● github.com/questdb/questdb ● https://questdb.io ● https://demo.questdb.io ● https://github.com/javier/questdb-quickstart ● https://slack.questdb.io/ 40 Javier Ramirez @supercoco9 We 💕 contributions and GitHub ⭐ stars