Your Database Cannot Do this (well)

Your database cannot do this (well)
Javier Ramirez (@supercoco9)
Developer Relations Lead at

About me: I like databases & open source
2022- Developer Relations Lead at QuestDB
● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink
2019-2022 Developer Advocate at AWS, specialized on Data & Analytics
● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data Analytics, Redshift,
ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark…
2013-2018 Co-founder/Data Engineer at Teowaki. Google Developer Expert
● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM, Apache Flink, HBase, MongoDB,
Presto
2006-2012 CTO ASPgems
● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch
Previous jobs (late nineties to 2005):
● MS Access, MySQL, Oracle, Sybase, Informix
As a student/hobbyist:
● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works

If you can use only one
database for everything,
go with PostgreSQL*
* Or any other major and well supported RDBMS

Some things where RDBMS are really good
● Finding individual records (by key or by complex conditions)
● Finding multiple records from a single table
● Joining multiple tables, with integrity guarantees
● Strong transactions
● Powerful aggregations, including window/analytical functions
● Compose complex statements using CTEs, subqueries, and/or correlated queries

How RDBMS work
● Key-based. Not designed for duplicates or keyless data
● Integrity and transactions can slow down things
● Data is stored row by row. Whole rows are read even if only a few columns are used
● Indexes are needed for performance. Storage/performance implications
● Strong types and schemas model

Some things RDBMS are not designed for
● Highly nested schemas
● Sparse data (tables with hundreds or thousands of columns)
● Asking data about not only which rows are connected, but how
● Identifying missing connections
● Writing data faster than it is read (several millions of inserts per day and faster)
● Querying rate of ingestion (how many records per second?) and identifying gaps
● Joining tables by approximate timestamp
● Aggregates over billions of records

Not all data problems
are the same

Document databases
● A single document can contain nested and repeated data. No joins necessary
● Query data via own API rather than SQL. Can feel more natural when using
object-based languages
● Flexible schema, so different documents can have some common fields, and some
variable fields
● Validation/constraints can be added to the document schema

12
db.runCommand({
"collMod": "peaks",
"validator": {
$jsonSchema: {
"bsonType": "object",
"description": "Document describing a mountain peak",
"required": ["name", "height", "location"],
"properties": {
"name": {
"bsonType": "string",
"description": "Name must be a string and is required"
},
"height": {
"bsonType": "number",
"description": "Height must be a number between 100 and 10000 and is required",
"minimum": 100,
"maximum": 10000
},
"location": {
"bsonType": "array",
"description": "Location must be an array of strings",
"minItems": 1,
"uniqueItems": true,
"items": {
"bsonType": "string"
}
}
},
}
}
})
https://www.digitalocean.com/community/tutorials/how-to-use-schema-validation-in-mongodb

13
db.createCollection( "orders",
{
validator: {
$expr:
{
$eq: [
"$totalWithVAT",
{ $multiply: [ "$total", { $sum:[ 1, "$VAT" ] } ] }
]
}
}
}
)
db.orders.insertOne( {
total: NumberDecimal("141"),
VAT: NumberDecimal("0.20"),
totalWithVAT: NumberDecimal("169")
} )
db.orders.insertOne( {
total: NumberDecimal("141"),
VAT: NumberDecimal("0.20"),
totalWithVAT: NumberDecimal("169.2")
} )
https://www.mongodb.com/docs/manual/core/schema-validation/specify-query-expression-rules/#std-label-schema-validation-query-expression

Graph data modelling
Data model
Vertices
Edges
Properties
Labels
Languages
Gremlin: Imperative traversal language created by Apache Tinkerpop
Open Cypher: Declarative language created by Neo4j

Some use cases for graphs
● Social networks
● Recommendations
● Knowledge Graphs
● Fraud detection
● Life sciences
● Networking and security
● Content management and workflows

Graph databases
● Navigate deep hierarchies
● Discover inter-relationships between items
● Find hidden connections between distant items
● Detect relationship patterns: Shortest path, hub detection, centrality (pagerank and
others), clustering coefficient, triadic closure…

Javier
Kermit
Piggy
FRIEND
FRIEND

g = graph.traversal()
g.V().has('name','Javier').as('user').
both('FRIEND').aggregate('friends').
both('FRIEND').
where(neq('user')).where(neq('friends')).

Javier
Kermit
Piggy
FRIEND
FRIEND
Draco
Gonzo
FRIEND
FRIEND
FRIEND
FRIEND
Fozzy
Animal
FRIEND
FRIEND
FRIEND

g = graph.traversal()
g.V().has('name','Javier').as('user').
both('FRIEND').aggregate('friends').
both('FRIEND').
where(neq('user')).where(neq('friends')).
groupCount().by('name').
order(local).by(values, decr)

Do you have a time-series problem? Write patterns
● You mostly insert data. You rarely update or delete individual rows
● It is likely you write data more frequently than you read data
● Since data keeps growing, you will very likely end up with much bigger
data than your typical operational database would be happy with
● Your data origin might experience bursts or lag, but keeping the correct
order of events is important to you
● Both ingestion and querying speed are critical for your business

Do you have a time-series problem? Read patterns
● Most of your queries are scoped to a time range
● You typically access recent/fresh data rather than older data
● But still want to keep older data around for occasional analytics
● Maybe you eventually just delete data because it is too old
● You often need to resample your data for aggregations/analytics
● You often need to align timestamps from multiple data series

We would like to be known for:
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)

All benchmarks are lies
Time-series specialised benchmark
https://github.com/timescale/tsbs
Ingestion benchmark results. 4.3 records per second
https://github.com/questdb/questdb/releases/tag/7.0.1
While running queries scanning over 4 billion rows per second (16 CPU threads)
https://questdb.io/blog/2022/05/26/query-benchmark-questdb-versus-clickhouse-timescale/

QuestDB demo
https://demo.questdb.io/

https://github.com/questdb/questdb
https://questdb.io/cloud/

More info
https://questdb.io
https://demo.questdb.io
https://github.com/javier/questdb-quickstart
We 💕 contributions and ⭐ stars
github.com/questdb/questdb
THANKS!
Javier Ramirez, Head of Developer Relations at QuestDB, @supercoco9

Your Database Cannot Do this (well)

More Related Content

Similar to Your Database Cannot Do this (well)

More from javier ramirez

Recently uploaded

Your Database Cannot Do this (well)