Brought to you by
Building efficient,
multi-threaded filters for
faster SQL queries
Vlad Ilyushchenko
Co-Founder and CTO of QuestDB
Vlad Ilyushchenko
Co-founder, CTO, QuestDB
■ Turned a side hustle into a company
■ I am interested in human psychology and high performance
computing in equal measure
■ Away from work I mostly play hide and seek with my kids
Who are we?
QuestDB is a time series database
What are we solving?
What is the problem?
Most queries for time series include filters to:
■ Find a time series from table of multiple time series
where symbol = ‘value’
■ Find outlier records in a time series
where value > 10.5
■ Find tagged records in a time series
where tag = ‘value’ and rate < 0.4
Fast column scan
Some benefits:
■ Single data structure - multiple queries
■ Reduces disk space usage
■ No impact on ingestion
■ Full scan algorithm scales well with “sparse” indexes
Live demo
Software components
■ JIT compilation
■ Linear memory access
■ SIMD
■ Efficient multi-core execution
■ Efficient memory management
■ SQL execution order optimization
How did we implement it?
Filter function
… where lat > 10 and lon < 100:
… where rowid in fn(lat, lon):
uint64_t[] fn(double[] lat, double[] lon, uint64_t size)
Assembling filter function
Function is assembled using AsmJit.
■ AsmJit vs LLVM
■ IR - the Intermediate Representation
● Expression is parsed to IR by Java
● IR is bytecode, lives in native memory
● IR is passed to JIT
■ AVX2 assembly
● JIT processes the IR and emits AVX2 assembly instructions
● JIT uses AsmJit for cross-platform function abstraction and register allocation
■ Combining predicates
● JIT generated code has to combine results of a > 10 and b < 3
Calling search function
Call search function from Java:
■ Prepare arguments, memory mapped columns of data
■ Call search function via JNI
■ Filter rows in batches benefiting from tight loop
■ Expose data via “row” API
Concurrency
Chained concurrency. Two goals:
■ Perform searches on multiple data chunks concurrently
■ Begin offloading search results before the search fully completes
■ Queue based - share nothing
Concurrency
Filter function is stateless*, in that it only uses stack. So we can call it from
multiple threads.
■ Fork - synchronous
● Chunk the data - prepare Data Frames
● Queue the execution tasks
■ Reduce - performed on thread pool
● Call search function concurrently for multiple Data Frames
● Store row ids in reusable “arena”
■ Join - performed by caller
● Translate row ids into data
● Submit for further processing
* non-JIT filter functions can be stateful
Concurrency model
Single Writer Principle
Single writer per table, Multi-Version Concurrency Control.
■ Append-only table versions
■ Transaction commit bumps tx watermark
■ Order by time before commit
■ Late data triggers new version creation
Concurrency components
Threading model
Inter-thread messaging framework
MPSequence pubSeq = new MPSequence();
MCSequence subSeq = new MCSqeuence();
pubSeq.then(subSeq).then(pubSeq);
// publisher thread
pubSeq.next()
// consumer thread
subSeq.next();
Pub Sequence
Sub Sequence
Work stealing
Publisher becomes consumer when queue is
full. Consuming behaviour can be adjusted to
prioritise publishing.
■ Publisher does not waste time
■ Pub/Sub system can work on 1 thread
■ Consumer thread does other things when
queue is empty
long cur = pubSeq.next()
if (cur > -1) {
// publish
} else {
cur = subSeq.next();
if (cur > -1) {
// consume
}
}
Circuit breaker
■ We can interrupt long running execution
● On timeout
● On connection drop
● If we detect we have done enough work (LIMIT N / LIMIT -N)
■ Circuit breaker is an atomically executed code injected into every queue
slot
Other nuances
We employ a number of techniques to reduce wait, contention and memory
consumption:
■ Fixed worker pool
■ Tagged fork-join sequences
■ Sharded bounded queues
■ Reusable row id “arena”
What did we learn?
Conclusion
■ Pros:
● Nice performance gains
● Faster than index where index size is larger than column scanned
● Full scan works well with sparse indexes
■ Cons:
● Implementation is quite complex and indirect
● It was hard to find all of many race conditions
● Lots of “fuzz” tests
Further Resources
■ Live demo: https://demo.questdb.io/
■ Blog: How we built a SIMD JIT compiler for SQL in QuestDB
■ Blog: 4Bn rows/sec query benchmark: Clickhouse vs QuestDB vs Timescale
■ Blog: How we built inter-thread messaging from scratch
■ Blog: My journey making QuestDB
We 💕 All Contributions
github.com/questdb/questdb
Thank you!
Q&A
Brought to you by
Vlad Ilyushchenko
vlad@questdb.io
@ilyushvl

Building Efficient Multi-Threaded Filters for Faster SQL Queries

  • 1.
    Brought to youby Building efficient, multi-threaded filters for faster SQL queries Vlad Ilyushchenko Co-Founder and CTO of QuestDB
  • 2.
    Vlad Ilyushchenko Co-founder, CTO,QuestDB ■ Turned a side hustle into a company ■ I am interested in human psychology and high performance computing in equal measure ■ Away from work I mostly play hide and seek with my kids
  • 3.
  • 4.
    QuestDB is atime series database
  • 5.
    What are wesolving?
  • 6.
    What is theproblem? Most queries for time series include filters to: ■ Find a time series from table of multiple time series where symbol = ‘value’ ■ Find outlier records in a time series where value > 10.5 ■ Find tagged records in a time series where tag = ‘value’ and rate < 0.4
  • 7.
    Fast column scan Somebenefits: ■ Single data structure - multiple queries ■ Reduces disk space usage ■ No impact on ingestion ■ Full scan algorithm scales well with “sparse” indexes
  • 8.
  • 9.
    Software components ■ JITcompilation ■ Linear memory access ■ SIMD ■ Efficient multi-core execution ■ Efficient memory management ■ SQL execution order optimization
  • 10.
    How did weimplement it?
  • 11.
    Filter function … wherelat > 10 and lon < 100: … where rowid in fn(lat, lon): uint64_t[] fn(double[] lat, double[] lon, uint64_t size)
  • 12.
    Assembling filter function Functionis assembled using AsmJit. ■ AsmJit vs LLVM ■ IR - the Intermediate Representation ● Expression is parsed to IR by Java ● IR is bytecode, lives in native memory ● IR is passed to JIT ■ AVX2 assembly ● JIT processes the IR and emits AVX2 assembly instructions ● JIT uses AsmJit for cross-platform function abstraction and register allocation ■ Combining predicates ● JIT generated code has to combine results of a > 10 and b < 3
  • 13.
    Calling search function Callsearch function from Java: ■ Prepare arguments, memory mapped columns of data ■ Call search function via JNI ■ Filter rows in batches benefiting from tight loop ■ Expose data via “row” API
  • 14.
    Concurrency Chained concurrency. Twogoals: ■ Perform searches on multiple data chunks concurrently ■ Begin offloading search results before the search fully completes ■ Queue based - share nothing
  • 15.
    Concurrency Filter function isstateless*, in that it only uses stack. So we can call it from multiple threads. ■ Fork - synchronous ● Chunk the data - prepare Data Frames ● Queue the execution tasks ■ Reduce - performed on thread pool ● Call search function concurrently for multiple Data Frames ● Store row ids in reusable “arena” ■ Join - performed by caller ● Translate row ids into data ● Submit for further processing * non-JIT filter functions can be stateful
  • 16.
  • 17.
    Single Writer Principle Singlewriter per table, Multi-Version Concurrency Control. ■ Append-only table versions ■ Transaction commit bumps tx watermark ■ Order by time before commit ■ Late data triggers new version creation
  • 18.
  • 19.
  • 20.
    Inter-thread messaging framework MPSequencepubSeq = new MPSequence(); MCSequence subSeq = new MCSqeuence(); pubSeq.then(subSeq).then(pubSeq); // publisher thread pubSeq.next() // consumer thread subSeq.next(); Pub Sequence Sub Sequence
  • 21.
    Work stealing Publisher becomesconsumer when queue is full. Consuming behaviour can be adjusted to prioritise publishing. ■ Publisher does not waste time ■ Pub/Sub system can work on 1 thread ■ Consumer thread does other things when queue is empty long cur = pubSeq.next() if (cur > -1) { // publish } else { cur = subSeq.next(); if (cur > -1) { // consume } }
  • 22.
    Circuit breaker ■ Wecan interrupt long running execution ● On timeout ● On connection drop ● If we detect we have done enough work (LIMIT N / LIMIT -N) ■ Circuit breaker is an atomically executed code injected into every queue slot
  • 23.
    Other nuances We employa number of techniques to reduce wait, contention and memory consumption: ■ Fixed worker pool ■ Tagged fork-join sequences ■ Sharded bounded queues ■ Reusable row id “arena”
  • 24.
  • 25.
    Conclusion ■ Pros: ● Niceperformance gains ● Faster than index where index size is larger than column scanned ● Full scan works well with sparse indexes ■ Cons: ● Implementation is quite complex and indirect ● It was hard to find all of many race conditions ● Lots of “fuzz” tests
  • 26.
    Further Resources ■ Livedemo: https://demo.questdb.io/ ■ Blog: How we built a SIMD JIT compiler for SQL in QuestDB ■ Blog: 4Bn rows/sec query benchmark: Clickhouse vs QuestDB vs Timescale ■ Blog: How we built inter-thread messaging from scratch ■ Blog: My journey making QuestDB We 💕 All Contributions github.com/questdb/questdb
  • 27.
  • 28.
    Brought to youby Vlad Ilyushchenko vlad@questdb.io @ilyushvl