Your Timestamps Deserve Better than a Generic Database

Your Timestamps Deserve Better
Than a Generic Database
Javier Ramirez
@supercoco9

About me: I like databases & open source
2022- today. Developer relations at an open source database vendor
● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink
2019-2022. Data & Analytics specialist at a cloud provider
● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data Analytics, Redshift,
ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark…
2013-2018. Data Engineer/Big Data & Analytics consultant
● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM, Apache Flink, HBase, MongoDB,
Presto
2006-2012 - Web developer
● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch
late nineties to 2005. Desktop/CGI/Servlets/ EJBs/CORBA
● MS Access, MySQL, Oracle, Sybase, Informix
As a student/hobbyist (late eighties - early nineties)
● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works, Informix
The pre-SQL years
The licensed SQL period
The libre and open SQL
revolution / The NoSQL
rise
The hadoop dark ages / The
python hegemony/ The cloud
database big migrations
The streaming era/ The
database as a service
singularity
The SQL revenge/ the
realtime database/the
embedded database

https://twitter.com/rshidashrf/status/975566803052134400

What happened in China in 1927
on December 31st
at 23:54:08?

Working with timestamped data in
a database is tricky*
* specially working with analytics of data changing over time or at a high rate

If you can use only one
database for everything,
go with PostgreSQL*
* Or any other major and well supported RDBMS

Some things RDBMS are not designed for
● Writing data faster than it is read (several millions of inserts per day and faster)
● Querying rate of ingestion (how many records per second?) and identifying gaps
● Joining tables by approximate timestamp
● Aggregates over billions of records
● Sparse data (tables with hundreds or thousands of columns)

Now imagine
● a factory floor with 500 machines, or
● a fleet with 500 vehicles, or
● 50 trains, with 10 cars each, or
● 500 users with a mobile phone
Sending data every second

43,200,000 rows a day…….
302,400,000 rows a week….
1,314,144,000 rows a month

Not all data problems
are the same

We would like to be known for:
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)

Do you have a time-series problem? Write patterns
● You mostly insert data. You rarely update or delete individual rows
● Since data keeps growing, you will very likely end up with much bigger
data than your typical operational database would be happy with
● Both ingestion and querying speed are critical for your business

Do you have a time-series problem? Read patterns
● Most of your queries are scoped to a time range
● You typically access recent/fresh data rather than older data
● Maybe you eventually just delete data because it is too old
● You often need to resample your data for aggregations/analytics
● You often need to align timestamps from multiple data series
● Your data powers near real-time dashboards

All benchmarks are lies (but they give us a ballpark)
Ingesting over 1.4 million rows per second (using 5 CPU threads)
https://questdb.io/blog/2021/05/10/questdb-release-6-0-tsbs-benchmark/
While running queries scanning over 4 billion rows per second (16 CPU threads)
https://questdb.io/blog/2022/05/26/query-benchmark-questdb-versus-clickhouse-timescale/
Time-series specialised benchmark
https://github.com/questdb/tsbs

https://github.com/questdb/questdb/releases/tag/7.0.1 (feb 23)

https://questdb.io/blog/optimizing-optimizer-time-series-benchmark-suite/

But I don’t need 4 million rows per second

Good, because you
probably aren’t getting
them.

Quick ingestion demo
Using the official QuestDB clients and the ILP protocol

Try out query performance on open datasets
https://demo.questdb.io/
Using SQL and the Postgres Wire Protocol

Performance is only part of the
question. Usability and
time-series semantics are
critical

WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
Try it live on
https://demo.questdb.io

SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
Try it live on

SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7'
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'
Try it live on

SAMPLE BY
Aggregates data in homogeneous time chunks
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 15m ALIGN TO CALENDAR;
SELECT timestamp, min(tempF),
max(tempF), avg(tempF)
FROM weather SAMPLE BY 1M;
Try it live on

How do you ask your database
to return which data is not
stored?

I am sending data every
second or so. Tell me which
devices didn’t send any data
with more than 1.5 seconds gap

SAMPLE BY … FILL
Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value)
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
Try it live on

LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where
multiple time series are stored in the same table.
SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol;
Try it live on

LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where
multiple time series are stored in the same table.
SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol, side;
Try it live on

What if I have two tables, where
data is (obviously) not sent at
the same exact timestamps
and I want to join by closest
matching timestamp?

ASOF JOIN (LT JOIN and SPLICE JOIN variations)
ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF JOIN
takes from the second time-series a timestamp that meets both of the following criteria:
● The timestamp is the closest to the first timestamp.
● The timestamp is strictly prior or equal to the first timestamp.
WITH trips2018 AS (
SELECT * from trips WHERE pickup_datetime in '2016'
)
SELECT pickup_datetime, fare_amount, tempF, windDir
FROM trips2018
ASOF JOIN weather;
Try it live on

Some things we are trying out next for performance
● Compression, and exploring data formats like arrow/ parquet
● Own ingestion protocol
● Embedding Julia in the database for custom code/UDFs
● Moving some parts to Rust
● Second level partitioning
● Improved vectorization of some operations (group by multiple
columns or by expressions
● Add specific joins optimizations (index nested loop joins, for
example)

https://github.com/questdb/questdb
https://cloud.questdb.com

https://questdb.io
https://github.com/javier/questdb-quickstart
https://slack.questdb.io/
We 💕 contributions and ⭐ stars
github.com/questdb/questdb
THANKS!
Javier Ramirez,
Developer Relations at QuestDB
@supercoco9

Your Timestamps Deserve Better than a Generic Database

Recommended

Recommended

More Related Content

Similar to Your Timestamps Deserve Better than a Generic Database

Similar to Your Timestamps Deserve Better than a Generic Database (20)

More from javier ramirez

More from javier ramirez (20)

Recently uploaded

Recently uploaded (20)

Your Timestamps Deserve Better than a Generic Database