Your database cannot do this (well)
Javier Ramirez (@supercoco9)
Developer Relations Lead at
About me: I like databases & open source
2022- Developer Relations Lead at QuestDB
● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink
2019-2022 Developer Advocate at AWS, specialized on Data & Analytics
● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data Analytics, Redshift,
ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark…
2013-2018 Co-founder/Data Engineer at Teowaki. Google Developer Expert
● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM, Apache Flink, HBase, MongoDB,
Presto
2006-2012 CTO ASPgems
● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch
Previous jobs (late nineties to 2005):
● MS Access, MySQL, Oracle, Sybase, Informix
As a student/hobbyist:
● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works
If you can use only one
database for everything,
go with PostgreSQL*
* Or any other major and well supported RDBMS
Some things where RDBMS are really good
● Finding individual records (by key or by complex conditions)
● Finding multiple records from a single table
● Joining multiple tables, with integrity guarantees
● Strong transactions
● Powerful aggregations, including window/analytical functions
● Compose complex statements using CTEs, subqueries, and/or correlated queries
How RDBMS work
● Key-based. Not designed for duplicates or keyless data
● Integrity and transactions can slow down things
● Data is stored row by row. Whole rows are read even if only a few columns are used
● Indexes are needed for performance. Storage/performance implications
● Strong types and schemas model
Some things RDBMS are not designed for
● Highly nested schemas
● Sparse data (tables with hundreds or thousands of columns)
● Asking data about not only which rows are connected, but how
● Identifying missing connections
● Writing data faster than it is read (several millions of inserts per day and faster)
● Querying rate of ingestion (how many records per second?) and identifying gaps
● Joining tables by approximate timestamp
● Aggregates over billions of records
Not all data problems
are the same
Document Databases:
Document databases
● A single document can contain nested and repeated data. No joins necessary
● Query data via own API rather than SQL. Can feel more natural when using
object-based languages
● Flexible schema, so different documents can have some common fields, and some
variable fields
● Validation/constraints can be added to the document schema
Mongodb quick demo
12
db.runCommand({
"collMod": "peaks",
"validator": {
$jsonSchema: {
"bsonType": "object",
"description": "Document describing a mountain peak",
"required": ["name", "height", "location"],
"properties": {
"name": {
"bsonType": "string",
"description": "Name must be a string and is required"
},
"height": {
"bsonType": "number",
"description": "Height must be a number between 100 and 10000 and is required",
"minimum": 100,
"maximum": 10000
},
"location": {
"bsonType": "array",
"description": "Location must be an array of strings",
"minItems": 1,
"uniqueItems": true,
"items": {
"bsonType": "string"
}
}
},
}
}
})
https://www.digitalocean.com/community/tutorials/how-to-use-schema-validation-in-mongodb
13
db.createCollection( "orders",
{
validator: {
$expr:
{
$eq: [
"$totalWithVAT",
{ $multiply: [ "$total", { $sum:[ 1, "$VAT" ] } ] }
]
}
}
}
)
db.orders.insertOne( {
total: NumberDecimal("141"),
VAT: NumberDecimal("0.20"),
totalWithVAT: NumberDecimal("169")
} )
db.orders.insertOne( {
total: NumberDecimal("141"),
VAT: NumberDecimal("0.20"),
totalWithVAT: NumberDecimal("169.2")
} )
https://www.mongodb.com/docs/manual/core/schema-validation/specify-query-expression-rules/#std-label-schema-validation-query-expression
Graph Databases:
Graph data modelling
Data model
Vertices
Edges
Properties
Labels
Languages
Gremlin: Imperative traversal language created by Apache Tinkerpop
Open Cypher: Declarative language created by Neo4j
Some use cases for graphs
● Social networks
● Recommendations
● Knowledge Graphs
● Fraud detection
● Life sciences
● Networking and security
● Content management and workflows
Graph databases
● Navigate deep hierarchies
● Discover inter-relationships between items
● Find hidden connections between distant items
● Detect relationship patterns: Shortest path, hub detection, centrality (pagerank and
others), clustering coefficient, triadic closure…
Javier
Kermit
Piggy
FRIEND
FRIEND
Javier
Kermit
Piggy
FRIEND
FRIEND
g = graph.traversal()
g.V().has('name','Javier').as('user').
both('FRIEND').aggregate('friends').
both('FRIEND').
where(neq('user')).where(neq('friends')).
Javier
Kermit
Piggy
FRIEND
FRIEND
Draco
Gonzo
FRIEND
FRIEND
FRIEND
FRIEND
Fozzy
Animal
FRIEND
FRIEND
FRIEND
g = graph.traversal()
g.V().has('name','Javier').as('user').
both('FRIEND').aggregate('friends').
both('FRIEND').
where(neq('user')).where(neq('friends')).
groupCount().by('name').
order(local).by(values, decr)
Apache Tinkerpop demo
Time-Series Databases:
Do you have a time-series problem? Write patterns
● You mostly insert data. You rarely update or delete individual rows
● It is likely you write data more frequently than you read data
● Since data keeps growing, you will very likely end up with much bigger
data than your typical operational database would be happy with
● Your data origin might experience bursts or lag, but keeping the correct
order of events is important to you
● Both ingestion and querying speed are critical for your business
Do you have a time-series problem? Read patterns
● Most of your queries are scoped to a time range
● You typically access recent/fresh data rather than older data
● But still want to keep older data around for occasional analytics
● Maybe you eventually just delete data because it is too old
● You often need to resample your data for aggregations/analytics
● You often need to align timestamps from multiple data series
We would like to be known for:
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)
All benchmarks are lies
Time-series specialised benchmark
https://github.com/timescale/tsbs
Ingestion benchmark results. 4.3 records per second
https://github.com/questdb/questdb/releases/tag/7.0.1
While running queries scanning over 4 billion rows per second (16 CPU threads)
https://questdb.io/blog/2022/05/26/query-benchmark-questdb-versus-clickhouse-timescale/
QuestDB demo
https://demo.questdb.io/
https://github.com/questdb/questdb
https://questdb.io/cloud/
More info
https://questdb.io
https://demo.questdb.io
https://github.com/javier/questdb-quickstart
We 💕 contributions and ⭐ stars
github.com/questdb/questdb
THANKS!
Javier Ramirez, Head of Developer Relations at QuestDB, @supercoco9

Your Database Cannot Do this (well)

  • 1.
    Your database cannotdo this (well) Javier Ramirez (@supercoco9) Developer Relations Lead at
  • 2.
    About me: Ilike databases & open source 2022- Developer Relations Lead at QuestDB ● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink 2019-2022 Developer Advocate at AWS, specialized on Data & Analytics ● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data Analytics, Redshift, ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark… 2013-2018 Co-founder/Data Engineer at Teowaki. Google Developer Expert ● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM, Apache Flink, HBase, MongoDB, Presto 2006-2012 CTO ASPgems ● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch Previous jobs (late nineties to 2005): ● MS Access, MySQL, Oracle, Sybase, Informix As a student/hobbyist: ● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works
  • 3.
    If you canuse only one database for everything, go with PostgreSQL* * Or any other major and well supported RDBMS
  • 4.
    Some things whereRDBMS are really good ● Finding individual records (by key or by complex conditions) ● Finding multiple records from a single table ● Joining multiple tables, with integrity guarantees ● Strong transactions ● Powerful aggregations, including window/analytical functions ● Compose complex statements using CTEs, subqueries, and/or correlated queries
  • 5.
    How RDBMS work ●Key-based. Not designed for duplicates or keyless data ● Integrity and transactions can slow down things ● Data is stored row by row. Whole rows are read even if only a few columns are used ● Indexes are needed for performance. Storage/performance implications ● Strong types and schemas model
  • 6.
    Some things RDBMSare not designed for ● Highly nested schemas ● Sparse data (tables with hundreds or thousands of columns) ● Asking data about not only which rows are connected, but how ● Identifying missing connections ● Writing data faster than it is read (several millions of inserts per day and faster) ● Querying rate of ingestion (how many records per second?) and identifying gaps ● Joining tables by approximate timestamp ● Aggregates over billions of records
  • 7.
    Not all dataproblems are the same
  • 9.
  • 10.
    Document databases ● Asingle document can contain nested and repeated data. No joins necessary ● Query data via own API rather than SQL. Can feel more natural when using object-based languages ● Flexible schema, so different documents can have some common fields, and some variable fields ● Validation/constraints can be added to the document schema
  • 11.
  • 12.
    12 db.runCommand({ "collMod": "peaks", "validator": { $jsonSchema:{ "bsonType": "object", "description": "Document describing a mountain peak", "required": ["name", "height", "location"], "properties": { "name": { "bsonType": "string", "description": "Name must be a string and is required" }, "height": { "bsonType": "number", "description": "Height must be a number between 100 and 10000 and is required", "minimum": 100, "maximum": 10000 }, "location": { "bsonType": "array", "description": "Location must be an array of strings", "minItems": 1, "uniqueItems": true, "items": { "bsonType": "string" } } }, } } }) https://www.digitalocean.com/community/tutorials/how-to-use-schema-validation-in-mongodb
  • 13.
    13 db.createCollection( "orders", { validator: { $expr: { $eq:[ "$totalWithVAT", { $multiply: [ "$total", { $sum:[ 1, "$VAT" ] } ] } ] } } } ) db.orders.insertOne( { total: NumberDecimal("141"), VAT: NumberDecimal("0.20"), totalWithVAT: NumberDecimal("169") } ) db.orders.insertOne( { total: NumberDecimal("141"), VAT: NumberDecimal("0.20"), totalWithVAT: NumberDecimal("169.2") } ) https://www.mongodb.com/docs/manual/core/schema-validation/specify-query-expression-rules/#std-label-schema-validation-query-expression
  • 14.
  • 15.
    Graph data modelling Datamodel Vertices Edges Properties Labels Languages Gremlin: Imperative traversal language created by Apache Tinkerpop Open Cypher: Declarative language created by Neo4j
  • 16.
    Some use casesfor graphs ● Social networks ● Recommendations ● Knowledge Graphs ● Fraud detection ● Life sciences ● Networking and security ● Content management and workflows
  • 17.
    Graph databases ● Navigatedeep hierarchies ● Discover inter-relationships between items ● Find hidden connections between distant items ● Detect relationship patterns: Shortest path, hub detection, centrality (pagerank and others), clustering coefficient, triadic closure…
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    Do you havea time-series problem? Write patterns ● You mostly insert data. You rarely update or delete individual rows ● It is likely you write data more frequently than you read data ● Since data keeps growing, you will very likely end up with much bigger data than your typical operational database would be happy with ● Your data origin might experience bursts or lag, but keeping the correct order of events is important to you ● Both ingestion and querying speed are critical for your business
  • 26.
    Do you havea time-series problem? Read patterns ● Most of your queries are scoped to a time range ● You typically access recent/fresh data rather than older data ● But still want to keep older data around for occasional analytics ● Maybe you eventually just delete data because it is too old ● You often need to resample your data for aggregations/analytics ● You often need to align timestamps from multiple data series
  • 28.
    We would liketo be known for: ● Performance ○ Better performance with smaller machines ● Developer Experience ● Proudly Open Source (Apache 2.0)
  • 29.
    All benchmarks arelies Time-series specialised benchmark https://github.com/timescale/tsbs Ingestion benchmark results. 4.3 records per second https://github.com/questdb/questdb/releases/tag/7.0.1 While running queries scanning over 4 billion rows per second (16 CPU threads) https://questdb.io/blog/2022/05/26/query-benchmark-questdb-versus-clickhouse-timescale/
  • 30.
  • 31.
  • 32.
    More info https://questdb.io https://demo.questdb.io https://github.com/javier/questdb-quickstart We 💕contributions and ⭐ stars github.com/questdb/questdb THANKS! Javier Ramirez, Head of Developer Relations at QuestDB, @supercoco9